The why, how and what of publishing research data
Each year, on the first Thursday of November, World Digital Preservation Day is celebrated. In honor of today, we share our thoughts on depositing research data in a data repository. Happy World Digital Preservation Day!
The why of data publication
Creating FAIR (Findable, Accessible, Interoperable, and Reusable) data is a team effort. One of the main tasks for you as a researcher is to prepare your research data for publication at the end of a (sub)project. This is also known as a data deposit.
Depositing your research data in a data repository helps to make your data FAIR, and it enhances your visibility and trustworthiness as a researcher. In addition, data publishing is a good practice in terms of Open Science, because you make your scientific knowledge available to others, including societal partners. Within Leiden University, the reasoning behind making your data available is operationalised in the data management policies, at central level, as well as on faculty level (e.g. at the Faculty of Social and Behavioural Sciences).
We consider a published dataset as an academic output in and of itself. It is therefore advisable to choose a repository specifically dedicated to publishing research data. We recommend to not publish the data as supplementary materials to a journal article, because in that case the data fall under same conditions as the journal article, which can be a tricky interplay between licenses and agreements with publishers (see our recent blog on this). Also, journals can change their policies or simply cease to exist with no obligation to keep your data safe.
The what of data publication
It is important to note that not all data or files generated in a research project need to be part of a research data publication. We generally follow Sabine Leonelli in her definition of research data:
"a relational category applied to research outputs that are taken, at specific moments of inquiry, to provide evidence for knowledge claims of interest to the researchers involved. Data thus consist of a specific way of expressing and presenting information, which is produced and/or incorporated in research practices so as to be available as a source of evidence, and whose behaviour and scientific significance depend on the context in which it is used. In this view, data do not have truth-value in and of themselves, nor can they be seen as straightforward representations of given phenomena. Rather, data are essentially fungible objects, which are defined by their portability and their prospective usefulness as evidence." (Leonelli 2015: 811)
This means that, for example, administrative data, such as meeting minutes or consent forms, are generally not considered research data. At the end of a project or study, an analysis needs to be made of the different types of files - a selection process that should lead to a decision on which files or which data goes where after the project.
This decision depends on the file type, the file size, as well as the file’s content. Files with personal data, for example, may be suitable for depositing under restricted access in a data repository, but some (personal or otherwise sensitive) data may have to be stored on institution-internal systems only. Also, there may be domain-specific requirements for publishing research data such as the disciplinary guidelines for the Social Sciences that require creating a Publication Package.
The how of data publication
With data publication we refer to the publishing of research data, with appropriate metadata, in a data repository. In that sense, it contrasts with storing the research data on institutional internal storage facilities, which is, in our field of Research Data Management, often referred to as archiving.
Once it is clear which data will form part of the data deposited, it is time to select an appropriate data repository. There are many different data repositories, and it can be difficult to identify the one repository most suitable for your data. At Leiden University, some faculties express a strong preference in their data protocol. In other cases, the definitive choice is left to the individual researcher, as long as a trustworthy repository is used, where the data are handled according to the FAIR principles.
The Leiden University data repository
The Faculty of Social and Behavioural Sciences, the Faculty of Governance and Global Affairs and the Faculty of Humanities make use of the Leiden University data repository, an institutional instance of DataverseNL. If you are a researcher at one of these faculties, this data repository is highly recommended. If you have any questions about your data deposit, you can contact your data steward or contact us at datamanagement@library.leidenuniv.nl.
click here to see the Leiden University data repository
The CDS offers an overview of Research Data Services, including trustworthy data repositories. For a broader search into data repositories around the world, you can also consult the Registry of Research Data Repositories.
Criteria for assessing suitability of a data repository
1. Long-term preservation
Are your data relevant for future generations of researchers? If so, look for a repository that offers long-term preservation (>10 years, permanently), including a series of managed activities necessary to ensure continued access to digital materials.
2. Trustworthiness
A trustworthy data repository usually offers to provide a persistent identifier (e.g. a DOI) for your dataset. Often, the metadata are made openly available in a catalogue, and shared via other platforms (e.g. OpenAire, ODISSEI). In most repositories, you can choose the license for your dataset from a fixed set such as the Creative Commons licenses.
Defining trustworthiness is not clear-cut, but some data repositories have demonstrated their trustworthiness in detail so that it leads to a certification like the CoreTrustSeal, Nestor, or ISO16363.
3. General-purpose vs. domain-specific repository
A general-purpose repository accepts datasets from all disciplines, and sometimes also outputs other than data (e.g. Zenodo). Consequently, the depositing guidelines are often not very strict. If you want to make your dataset better reusable, you may want to choose a domain-specific repository that uses specific vocabularies in their metadata (e.g. The Language Archive for linguistics).
4. Level of curation
Data curation can mean many things. At the Leiden University data repository, we offer curation in the form of help at the time of your data deposit. Once a data set is deposited and published, the data curators are available for coordinating access requests or uploading new versions of the dataset. For repositories that offer long-term preservation, curation also includes the conversion of files if formats become obsolete.
References
Leonelli S. (2015) What Counts as Scientific Data? A Relational Framework. Philosophy of Science, 82(5): 810-821. https://doi.org/10.1086/684083
Acknowledgements
This blog was reviewed by Kristina Hettne and Pascal Flohr and edited by Pascal Flohr.