Skip to main content

lightning talk

Keywords for data discovery

Sept 21, 11.30 CEST

Libraries, research administrators, Open Science Infrastructure providers, funders

Interdisciplinary collaborations: Networks, services, methods
Sustaining Open infrastructures, services and tools for research communities
Value added data products/services from open science
European Open Science Cloud (EOSC) and FAIR data 

Researchers, research Infrastructures and research communities, repository managers, publishers and content providers, libraries, research administrators, service providers and innovators

Finding research data is often described as difficult or challenging (Brickley, Burgess, & Noy, 2019) (Chapman, et al., 2020), especially in comparison to literature search (Kern & Mathiak, 2015). From observation (Krämer, Papenmeier, Carevic, Kern, & Mathiak, 2021) and surveys (Gregory, Groth, Scharnhorst, & Wyatt, 2020) (Friedrich, 2020) we know that data discovery is a complex process, which involves doing literature review, using data portals, reading documentation, and leveraging personal networks. However, the glue that holds all these steps together is the common web search, e.g. via Google. Unfortunately, due to the lack of central, fully indexed repositories, individual data repositories have the responsibility to make their data visible for web search. In this paper we explore how research data is found via general web search by analyzing the queries made to Google using clustering techniques, retrieved via the Google Search Console. The clustering is based on two different keyword features: their probabilities in the queries and their Comparable Click Through Rate (CCTR). The latter is a normalized version of CTR, which allows keywords comparison. We use the query logs from three data portals from the Social Sciences domain, from two different institutions, in addition to a JSON file with mentions of datasets in research papers taken from Social Science Open Access Repository (SSOAR). The use case we are most interested in is the known item search. Here, a dataset is retrieved by name, which has been communicated through the literature or personal communication. These names are often ambiguous, such as acronyms or common nouns, and additional keywords are added by the researchers to find the dataset’s website. The results of our analysis provide a set of keywords which, when systematically added in proper locations of the research data landing pages, can help to make them more discoverable.


Brigitte Mathiak, GESIS
  • This email address is being protected from spambots. You need JavaScript enabled to view it.