Funding information extraction

Authors & Affiliation

Anastasios Giannakopoulos, Yannis Foufoulas and Harry Dimitropoulos. - Athena RIC

Abstract

Funding information is an important type of metadata for any research information system and is useful in article, authors, institutions and funding statistics and visual analytics. It allows research administrators to discover and present trends in a temporal way, to track statistics for different funding bodies and calls, to evaluate the output and when combined with other data to assess the return on the investment (in social or economic terms). Unfortunately, most current publishing systems do not integrate funding information in their metadata, while the few ones that do, use proprietary formats.

Our main goal is to extract funding information from the unstructured text of publications. For this reason, we have used a database created from Crossref that contains names, aliases and other metadata for more than 10 thousands funding organizations.

Our algorithm runs in three steps:
● In the first step, we extract acknowledgement statements from the text. For doing that, we apply a high pass filtering method on the text, so that we keep just the sections with acknowledgement statements. The threshold of the high pass filtering is the average density of phrases that are commonly used in an acknowledgement statement (e.g. Funded by, Partially supported by etc.). When a region in the text is dense in such phrases, then it is not filtered out, and finally the acknowledgement statements have been extracted.
● In the second step, we apply string matching. We match the acknowledgement statements to the funding organizations names or aliases and for each publication we produce a list with its possible funders.
● The last step is the disambiguation step. We mainly use words or phrases that are adjacent to the string match and other information like the affiliation name, the authors' nationality, the publication date.

Open data, funding information