Exploring trends and impact of scientific publications based on open access journals: an application in the archaeological research domain
Traditionally, analyses of scientific publications have followed a top-down approach, where existing taxonomies, derived by journal and bibliometric classifications, are imposed onto one’s data. This approach has limitations when it comes to capturing interdisciplinary and emerging fields of study. We believe that, by leveraging on new AI techniques, we can get insights on impacts and trends in a much richer and semantically-driven way. Another limitation of traditional approaches is their reliance on close and proprietary data repositories. Nowadays, with the availability of new open databases (OpenAlex, OpenAIRE, etc) and the increasing availability of open access journals, doing an analysis of scientific publications based on open resources has become much more feasible. The aim of this talk is to demonstrate both types of innovation by presenting a particular application in the archaeological domain.
Our use-case is an analysis of articles published in Archaeologia e Calcolatori (A&C), an international open-access journal specialized on computer applications in Archaeology, the repository of which acts as a provider for OpenAIRE and Europeana. The end result is a knowledge map that allows one to access scientific contents in a nuanced and meaningful way as well as to understand the impact and the specialisation of its publications with respect to other similar journals.
We use two main data sources for the last 10 years: the A&C publications and the proceedings of the conferences from Computer Applications and Quantitative Methods in Archaeology, a comparatively similar collection. This second set serves as a benchmark with which to compare A&C.
When it comes to high-level categorization, we performed both top-down and bottom-up techniques. First, we trained a multi-label classifier based on the Association for Computing Machinery taxonomy. Next, we performed topic modelling on our dataset, by clustering together similar titles and abstracts. In both cases, we used the pre-trained model Specter2 [1], a language model specialized on scientific literature, in order to vectorize the texts.
Both of these strategies offer a coarse-grained look at the data, but often one needs to access a higher level of granularity to extract useful insights. To this end, automatically identifying important entities (eg. places, artefacts) in a text can prove useful. In our study, we used a pre-trained named entity recognition model for archaeologically-relevant entities. Finally, when it comes to computer applications in archaeology, an important type of entities are technologies (eg. LiDAR sensor, virtual reality). To match publications with technologies, we queried Wikidata to obtain a comprehensive list of technologies and then applied fuzzy matching with our texts, thereby obtaining the desired links.
By the end of the talk, the audience will have seen an end-to-end example of semantically-driven mapping of scientific publications which makes it possible to navigate content in a much richer way than normally available. With this, we hope to demonstrate the feasibility of using open data to perform science mapping and to showcase tools and strategies that, being domain-agnostic, can be applied beyond any particular field.