Funding information extraction

Authors & Affiliation

Anastasios Giannakopoulos, Yannis Foufoulas and Harry Dimitropoulos. - Athena RIC

Abstract

Funding information is an important type of metadata for any research information system and is useful in article, authors, institutions and funding statistics and visual analytics. It allows research administrators to discover and present trends in a temporal way, to track statistics for different funding bodies and calls, to evaluate the output and when combined with other data to assess the return on the investment (in social or economic terms). Unfortunately, most current publishing systems do not integrate funding information in their metadata, while the few ones that do, use proprietary formats.

Our main goal is to extract funding information from the unstructured text of publications. For this reason, we have used a database created from Crossref that contains names, aliases and other metadata for more than 10 thousands funding organizations.

Our algorithm runs in three steps:
● In the first step, we extract acknowledgement statements from the text. For doing that, we apply a high pass filtering method on the text, so that we keep just the sections with acknowledgement statements. The threshold of the high pass filtering is the average density of phrases that are commonly used in an acknowledgement statement (e.g. Funded by, Partially supported by etc.). When a region in the text is dense in such phrases, then it is not filtered out, and finally the acknowledgement statements have been extracted.
● In the second step, we apply string matching. We match the acknowledgement statements to the funding organizations names or aliases and for each publication we produce a list with its possible funders.
● The last step is the disambiguation step. We mainly use words or phrases that are adjacent to the string match and other information like the affiliation name, the authors' nationality, the publication date.

Tags: Open data funding information

Sponsors

Platinum

frontiers h65 w230

gold

Athena Vertical EN h124 w160

Silver

 
   
f1000 h35 w136

 

willey h41 w176

 

fig.share NoSubheadColour h72 w177

 

 Largest resolution RC logo h72 w198

Bronze

Logo C Pub RGB   

mdpi small

 DOAJ h40 w205

 Overleaf h47 w160

 

 PLOS one logo h42 w200

 arpha logo whole 01 h64 w175

TF Group logo blue h45 w246

 logo edp h72 w212

 SN stack logo h60 145