Large scale Deep learning techniques to monitor the openness of research data and software in France
Public policies on open science are increasingly promoting the sharing of research datasets and software. As a consequence, more and more countries and funders want to monitor the openness of the publications, research data and software produced. In this context, in addition to the monitoring of open access to publications available since 2019, France has implemented a large-scale analysis of scientific publications to track the openness of these important research products at national level.
Our method is based on state-of-the-art machine learning and document engineering techniques. It combines PDF full-text harvesting, document structuring, dataset and software mention detection, context characterization and corpus level aggregation.
Applying text mining techniques on full texts brings many advantages in terms of coverage and adaptability. Our method can be applied in particular to any corpus of publications of geographical regions, institutions or disciplines. The accuracy of text mining has improved significantly recently thanks to Deep Learning. Combined with negative sampling techniques to address mention sparsity and false positives, the F1 evaluation scores obtained on real distribution of publications reach 80%.
This method has been applied at the national-scale over around 700k publications.
The annotations produced at the document level are then used by the French Open Science Monitor (https://frenchopensciencemonitor.esr.gouv.fr/) to track the openness of research datasets and software, in the context of the second National Plan for Open Science.
The source code and the data of the French Open Science Monitor, as well as all the associated tools and training datasets, are available under open licences.
Our demonstration will start with the visual annotation of a scholar PDF by the software mention detection tool called Softcite https://cloud.science-miner.com/software and the dataset mention detection tool DataStet. Then, we will present the results of an additional context classification of the mentions used to track the proportion of publications that use, produce and share research datasets and software. The exploitation of these results to create indicators in the “BSO” monitoring tool at scale, resulting from the processing of around 700K research publications from France. The web interface https://frenchopensciencemonitor.esr.gouv.fr/research-data/general shows these indicators, for each publication year since 2013, and for each discipline.
Results show in particular that 22% of French publications published in 2021 mention the sharing of their data, and that 20% of the French publications published in 2021 mention the sharing of their code or software.
The machine learning modules for software and dataset detection are open source as well as the rest of the code and data produced during this work. We hope and encourage others to use and adapt theses tools and framework to monitor the production and openness of dataset and software.