Funding information extraction

Authors & Affiliation

Anastasios Giannakopoulos, Yannis Foufoulas and Harry Dimitropoulos. - Athena RIC


Funding information is an important type of metadata for any research information system and is useful in article, authors, institutions and funding statistics and visual analytics. It allows research administrators to discover and present trends in a temporal way, to track statistics for different funding bodies and calls, to evaluate the output and when combined with other data to assess the return on the investment (in social or economic terms). Unfortunately, most current publishing systems do not integrate funding information in their metadata, while the few ones that do, use proprietary formats.

Our main goal is to extract funding information from the unstructured text of publications. For this reason, we have used a database created from Crossref that contains names, aliases and other metadata for more than 10 thousands funding organizations.

Our algorithm runs in three steps:
● In the first step, we extract acknowledgement statements from the text. For doing that, we apply a high pass filtering method on the text, so that we keep just the sections with acknowledgement statements. The threshold of the high pass filtering is the average density of phrases that are commonly used in an acknowledgement statement (e.g. Funded by, Partially supported by etc.). When a region in the text is dense in such phrases, then it is not filtered out, and finally the acknowledgement statements have been extracted.
● In the second step, we apply string matching. We match the acknowledgement statements to the funding organizations names or aliases and for each publication we produce a list with its possible funders.
● The last step is the disambiguation step. We mainly use words or phrases that are adjacent to the string match and other information like the affiliation name, the authors' nationality, the publication date.

Tags: Open data funding information



frontiers h65 w230


Athena Vertical EN h124 w160


f1000 h35 w136


willey h41 w176


fig.share NoSubheadColour h72 w177


 Largest resolution RC logo h72 w198


Logo C Pub RGB   

mdpi small

 DOAJ h40 w205

 Overleaf h47 w160


 PLOS one logo h42 w200

 arpha logo whole 01 h64 w175

TF Group logo blue h45 w246

 logo edp h72 w212

 SN stack logo h60 145