NIST SP 1500-18r2

NIST Research Data Framework (RDaF)

Version 2.0

Robert J. Hanisch

Office of Data and Informatics

Material Measurement Laboratory

Debra L. Kaiser

Office of Data and Informatics

Material Measurement Laboratory

Alda Yuan

Office of Data and Informatics

Material Measurement Laboratory

Andrea Medina-Smith

Office of Data and Informatics

Material Measurement Laboratory

Bonnie C. Carroll

Consultant

Eva M. Campo

Consultant

Campostella Research and Consulting

Alexandria, VA

This publication is available free of charge

https://doi.org/10.6028/NIST.SP.1500-18r2

February 2024

Abstract

The NIST Research Data Framework (RDaF) is a multifaceted and customizable tool that aims to help shape the future of open data access and research data management (RDM). The RDaF will allow organizations and individual researchers to develop their own RDM strategy. Though NIST is leading the RDaF, most of the content in the current version 2.0, which supersedes preliminary V1.0 and interim V1.5, was obtained via engagement with national and international leaders in the research data community. NIST held a series of three plenary and 15 stakeholder workshops from October 2021 to September 2023. Workshop attendees represented many stakeholder sectors: US government agencies, national laboratories, academia, industry, non-profit organizations, publishers, professional societies, trade organizations, and funders (public and private), including international organizations. The audience for the RDaF is the entire research data community in all disciplines—the biological, chemical, medical, social, and physical sciences and the humanities. The RDaF is applicable from the organization to the project level and encompasses a wide array of job roles involving RDM, from executives and Chief Data Officers to publishers, funders, and researchers. The RDaF is a map of the research data space that uses a lifecycle approach with six stages to organize key information concerning RDM and research data dissemination. Through a community-driven and in-depth process, NIST identified and defined specific, high-priority topics and subtopics for each lifecycle stage. The topics and subtopics are programmatic and operational activities, concepts, and other important factors relevant to RDM which form the foundation of the framework. This foundation enables organizations and individual researchers to use the RDaF for self-assessment of their RDM status. Each subtopic has several informative references—resources such as guidelines, standards, and policies—to help a user understand or implement that subtopic. As such, the RDaF may be considered a “best practices” document. Fourteen overarching themes—topic areas identified as pervasive throughout the framework—illustrate the connections among the six lifecycle stages. Finally, the RDaF includes eight sample profiles for common job functions or roles. Each profile contains topics and subtopics an individual in the given role needs to consider in fulfilling their RDM responsibilities. Individual researchers and organizations involved in the research data lifecycle will be able to tailor these sample profiles or generate entirely new profiles for their specific job function. The methodologies used to generate the content of this publication, RDaF V2.0, are described in detail. An interactive web application has been developed and released that provides an interface for all the components of the RDaF mentioned above and replicates this document. The web application is easy and intuitive to navigate and provides new functionality enabled by the interactive environment.

Disclaimer

Publications in the SP1500 subseries are intended to capture external perspectives related to NIST standards, measurement, and testing-related efforts. These external perspectives can come from industry, academia, government, and others. These reports are intended to document external perspectives and do not represent official NIST positions. The opinions, recommendations, findings, and conclusions in this publication do not necessarily reflect the views or policies of NIST or the United States Government.

Certain commercial entities, equipment, or materials may be identified in this document to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

NIST Technical Series Policies

Copyright, Fair Use, and Licensing Statements

NIST Technical Series Publication Identifier Syntax

Publication History

Approved by the NIST Editorial Review Board on 2023-12-21

Supersedes NIST Series 1500-18 version 1.5 (May 2023) https://doi.org/10.6028/NIST.SP.1500-18r1; NIST Series 1500-18 (February 2021) https://doi.org/10.6028/NIST.SP.1500-18

How to Cite this NIST Technical Series Publication

Hanisch, RJ; Kaiser, D; Yuan, A; Medina-Smith, A; Carroll, B; Campo, E (2023) NIST Research Data Framework (RDaF) Version 2.0. (National Institute of Standards and Technology, Gaithersburg, MD), NIST Special Publication (SP) 1500-18r2. https://doi.org/10.6028/NIST.SP.1500-18r2

NIST Author ORCID IDs

Robert Hanisch: 0000-0002-6853-4602

Debra Kaiser: 0000-0001-5114-7588

Alda Yuan: 0000-0001-9619-306X

Andrea Medina-Smith: 0000-0002-1217-701X

Bonnie Carroll: 0000-0001-8924-1000

Eva Campo: 0000-0002-9808-4112

Contact Information

rdaf@nist.gov

Foreword

Version 2.0 of the NIST Research Data Framework builds on the Preliminary version 1.0 released in February 2021 and on the interim version 1.5 released in May 2023, and incorporates input from many stakeholders. Version 2.0 has more than twice as many  topics and subtopics as V1.0 and includes new sections. The major new sections are overarching themes: terms prevalent in multiple lifecycle stages, and profiles, which provide a list of the most relevant topics and subtopics for a given job function or role within the research data management ecosystem. A Request for Information (RFI) based on interim V1.5 was posted in the Federal Register in early June 2023. All comments received in response to this RFI were considered and the RDaF V1.5 was revised as appropriate. A draft of this modified version was presented at a stakeholder workshop held in September 2023.

Author Contributions

Robert Hanisch: Conceptualization, Methodology, Supervision, Writing- review and editing; Debra Kaiser: Formal Analysis, Methodology, Writing- review and editing; Alda Yuan: Formal Analysis, Methodology, Project Administration, Writing- original draft, Writing- review and editing, Visualization; Andrea Medina-Smith: Data Curation, Formal Analysis, Visualization, Software, Writing- review and editing; Bonnie Carroll: Conceptualization, Supervision, Writing- review and editing; Eva M. Campo: Data Curation, Visualization, Writing- review and editing.

Acknowledgments

The completeness, relevance, and success of the NIST RDaF is wholly dependent on the input and participation of the broad research data community. NIST is grateful to all the workshop participants and others who have provided input to this effort. First and foremost, NIST thanks the members of the RDaF Steering Committee, past and present, who have given sound advice and shared their invaluable expertise since the inception of the RDaF in December 2019: Laura Biven, Cate Brinson, Bonnie Carroll (Chair), Mercè Crosas, Anita de Waard, Chris Erdmann, Joshua Greenberg, Martin Halbert, Hilary Hanahoe, Heather Joseph, Mark Leggott, Barend Mons, Sarah Nusser, Beth Plale, and Carly Strasser.

The RDaF team is also grateful to Susan Makar from the NIST Research Library for assistance with the informative references and to Angela Lee for development of the V2.0 interactive web application. Thanks to Eric Lin and James St. Pierre for their critical advice.

Thanks to the former members of the RDaF team including Breeze Dorsey, Laura Espinal, and Tamae Wong. Thanks as well to Campostella Research and Consulting for providing administrative support for the project and technical support for the natural language processing work. Our appreciation also goes to the NIST Material Measurement Laboratory (MML) leadership for their support and to all participants of the various workshops held to solicit community feedback, particularly those individuals who volunteered to serve as discussion leaders.

And finally, thanks to all involved with the NIST Cybersecurity Framework, which provided an initial model for development of the RDaF.

Keywords Research data, research data ecosystem, research data framework, research data lifecycle, research data management, research data dissemination, use, and reuse, research data governance, research data sharing, research data stewardship, open data.

1 Introduction

NIST’s Research Data Framework (RDaF) is designed to help shape the future of research data management (RDM) and open data access. Research data are defined here as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.”[1] The motivation for the RDaF as articulated in the first RDaF publication V1.0 [2]—that the research data ecosystem is complicated and requires a comprehensive approach to assist organizations and individuals in attaining their RDM goals—has not changed since the project was initiated in 2019. Developed through active involvement and input from national and international leaders in the research data community, the RDaF provides a customizable strategy for the management of research data. The audience for the RDaF is the entire research data community, including all organizations and individuals engaged in any activities concerned with RDM, from Chief Data Officers and researchers to publishers and funders. The RDaF builds upon previous data-focused frameworks but is distinct through its emphasis on research data, the community-driven nature of its formulation, and its broad applicability to all disciplines, including the social sciences and humanities.

The RDaF is a map of the research data space that uses a lifecycle approach with six high-level lifecycle stages to organize key information concerning RDM and research data dissemination. Through a community-driven and in-depth process, stakeholders identified topics and subtopics—programmatic and operational activities, concepts, and other important factors relevant to RDM. These topics and subtopics, identified via stakeholder input, are nested under the six stages of the research data lifecycle. A partial example of this structure is illustrated in Fig. 1.

Table which shows the nested organizational structure of the Framework core where Topics, Subtopics, and Informative References fall under the broader heading of the Research Data Lifecycle Stage

Fig. 1 — Partial organizational structure of the framework foundation

The components of the RDaF foundation shown in Fig. 1—lifecycle stages and their associated topics and subtopics—are defined in this document. In addition, most subtopics have several informative references—resources such as guidelines, standards, and policies—that assist stakeholders in addressing that subtopic. Specific standards and protocols provided in the text or informative references may only be relevant for certain RDM situations. A link to the complete list of informative references is given in Appendix A.

The RDaF is not prescriptive; it does not instruct stakeholders to take any specific approach or action. Rather, the RDaF provides stakeholders with a structure for understanding the various components of RDM and for selecting components relevant to their RDM goals. The RDaF also includes sample profiles, which contain topics and subtopics an individual in a job role or function are encouraged consider in fulfilling their RDM responsibilities. Researchers and organizations involved in the research data lifecycle will be able to tailor these profiles using a supplementary document and online tools that will be available on the RDaF homepage. Entirely new profiles may be generated using a blank on-line template available in this supplementary document. Other uses of the RDaF include self-assessment and improvement of RDM infrastructure and practices for both organizations and individuals.

The RDaF was designed to be applicable to all stakeholders involved in research data. An organization seeking to review their data management policies may use the subtopics to create their own metrics for RDM assessment. Researchers who wish to ensure that their data are open access may use the framework to create a “checklist” of RDM considerations and tasks. A research project leader seeking guidance on how to assign data management roles may use the eight sample profiles as a starting point to create customized lists of responsibilities for individual researchers in their lab.

Since the first publication of the RDaF in 2021 (V1.0 [2]), NIST has expanded and enriched the framework through extensive engagement with stakeholders in the research data community. This publication, RDaF V2.0, includes updates to V1.0 and new features. Definitions and informative references for each subtopic have been added to improve the usability and applicability of the RDaF. In addition to profiles discussed in the previous paragraph, this document includes overarching themes that appear across multiple lifecycle stages and a list of many of the key organizations in the RDM space (see Appendix B). The methodology used to generate the content of V2.0 is described in detail in the following section.

Note that the terms “data,” “datasets,” “data assets,” “digital objects,” and “digital data objects” are used throughout the framework depending on the context. Data is the most general and frequently used term. Dataset means a specific collection of data having related content. A data asset is “any entity that is comprised of data which may be a system or application output file, database, document, and web page.”[3] Digital objects and digital data objects typically have a structure such that they can be understood without the need for separate documentation. In addition, the terms “organization” and “institution” used throughout the framework are synonymous and the terms "RDaF team" and "team" refer to the authors of this publication. Finally, a list that spells out the full names of acronyms and initialisms used throughout this document is provided in Appendix C.

2 Methodology

This section describes the approaches used to develop RDaF V2.0, including brief descriptions of activities since the inception of the project in 2019. Throughout the lifetime of the RDaF project, the Steering Committee members noted previously in the Acknowledgements section were consulted, took leadership roles as discussion leaders at workshops, and provided valuable input and feedback on all aspects of the project.

2.1 Framework Development Through Stakeholder Input

The RDaF is driven by the research data stakeholder community, which can use the framework for multiple purposes such as identifying best practices for research data management (RDM) and dissemination and changing the research data culture in an organization. To ensure that the RDaF is a consensus document, NIST held stakeholder engagement workshops as the primary mechanism to gather input on the framework. The workshops have taken place in three phases, each resulting in further examination and refinement of the framework.

2.1.1 Phase 1: Plenary Scoping Workshop and Publication of the Preliminary RDaF V1.0

In the plenary scoping workshop held in December 2019, a group of about 50 distinguished research data experts selected a research data lifecycle approach as the organizing principle of the RDaF. The RDaF team subsequently selected six lifecycle stages—Envision, Plan, Generate/Acquire, Process/Analyze, Share/Use/Reuse, and Preserve/Discard—from a larger pool of stages suggested by workshop break-out groups. Feedback from this workshop contributed to the publication of the RDaF V1.0, which provides a structured and customizable approach to developing a strategy for the management of research data. The framework core (subsequently renamed foundation in V2.0) consisting of these six lifecycle stages and their associated topics and subtopics is the main result of that publication.

2.1.2 Phase 2: Opening Plenary Workshops

The second phase of the RDaF development began with two virtual plenary workshops held in late 2021. Each workshop had approximately 70 attendees and focused on two cohorts. The university cohort (UC) workshop, co-hosted by the Association of American Universities, the Association of Public Land-grant Universities, and the Association of Research Libraries, was a horizontal cut across various stakeholder roles in universities (e.g., vice presidents of research, deans, professors, and librarians), publishing organizations, data-based trade organizations, and professional societies. In contrast, the materials cohort (MC) workshop, held in cooperation with the Materials Research Data Alliance, was a vertical cut across stakeholder organizations engaged in materials science, including academia, government agencies, industry, publishers, and professional societies.

Prior to the workshops, the attendees selected, or were assigned to, one of six breakout sessions, each focused on a stage in the RDaF research data lifecycle. A NIST coordinator sent the attendees a link to the RDaF publication V1.0, a list of the participants, and definitions of the topics for that session’s lifecycle stage. The agenda for the two workshops included an overview talk by Robert Hanisch on the RDaF, a one-hour breakout session, and a plenary session with summaries presented by an attendee of each breakout and with closing remarks. During the breakout sessions, a discussion leader, recruited by the RDaF team, solicited input from the 10 to 12 participants on the following questions:

  1. What are the most important (two or three) topics and the least important one?

  2. Are there any missing topics?

  3. Should any topics be modified or moved to another lifecycle stage?

The identical questions were posed regarding the subtopics for each topic. Attendee input was captured as notes taken by the session rapporteur and the NIST coordinator and an audio recording. After the two opening plenary workshops, the RDaF team revised the topics and subtopics for the lifecycle stages based on input from the workshops. All six of the lifecycle stages were then reviewed side-by-side for consistency and completeness.

The collective review revealed 14 overarching themes which appeared in multiple lifecycle stages. These themes include metadata and provenance, data quality, the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles, software tools, and cost implications. Section 4 of this document will address all overarching themes in detail.

2.1.3 Phase 3: Stakeholder Workshops

The next step in obtaining community input involved a series of two-hour stakeholder workshops focused on specific roles, equivalent to job functions or position titles. To secure a broad range of feedback, the RDaF team compiled a list of more than 200 invitees, including attendees of previous workshops and additional experts. These invitees were assigned to one of the following 15 roles:

  • Academic mid-level executive/head of research

  • AI expert

  • Budget/cost expert

  • Curator

  • Data/IT leader

  • Data/research governance leader

  • Funder

  • Institute/center/program director

  • Open data expert

  • Professional society/trade organization leader

  • Professor

  • Provider of data tools/services/infrastructure

  • Publisher

  • Researcher

  • Senior executive

Unlike the first two RDaF workshops, these role-focused workshops were composed of smaller groups. The goal of these workshops was to develop profiles, i.e., lists of topics and subtopics important for individuals in a specific role with respect to RDM. Though the target size of these two-hour workshops was 10 to12 participants, the actual number ranged from four to 14. For each workshop, the RDaF team identified and invited an expert to serve as the discussion leader. Two members of the team were assigned to each workshop: a presenter and a rapporteur.

During the workshops, after a brief presentation covering the purpose and structure of the RDaF, participants selected the lifecycle stages most relevant to their assigned role. For each lifecycle stage, participants reviewed the topics and subtopics, and discussed any that were missing, misplaced or unclear. Depending on the length of the discussion, each workshop covered two to four of the lifecycle stages. In addition to requesting input on the topics and subtopics, the NIST coordinators asked participants to consider which topics and subtopics had the greatest influence on their role and those over which they had the greatest influence.

2.2 Framework Revisions per Stakeholder Workshop Input

Most of the input from participants at the Stakeholder Workshops concerned the topics and subtopics, and this input was used to revise them.

2.2.1 Stakeholder Workshop Note Aggregation

After the Stakeholder Workshops, the RDaF team designed a common methodology for collecting and analyzing the feedback, using a template to record the input from each workshop. This template contained the following:

  1. A column for topics and subtopics in a lifecycle stage that were missing, misplaced, or unclear

  2. A column for topics and subtopics relevant to, or missing from, the profile for a role

  3. A section on feedback that addressed the definition of the role

  4. A section on “takeaways” regarding the framework as a whole

  5. A section on proposed new overarching themes

To analyze the feedback from each stakeholder workshop, selected RDaF team members first reviewed the rapporteur’s notes to familiarize themselves with the discussion. Then these team members viewed the recording of the workshop, read through any written comments provided in the workshop chat, and noted every comment in the appropriate section of the template. After the first draft of the template notes was completed, the team members viewed the recording a second time, added any missing comments, and converted each comment and suggestion concerning a topic or subtopic into a potential change for review. Finally, the entire RDaF team considered each potential change and generated an updated interim V1.5 of the framework foundation.

2.2.2 Input for Profile Development

After updating the framework foundation based on the stakeholder feedback, the next step involved the generation of a sample profile for each role addressed by a workshop. As the feedback from the stakeholder workshops concerning profiles was limited and varied in form and specificity, more data were needed to develop these profiles.

The updated topics and subtopics were used to develop blank checklists of topics and subtopics for the lifecycle stages discussed at each of the 15 stakeholder workshops. The appropriate spreadsheet was sent to the participants of a given workshop with instructions to mark those topics and subtopics that were most relevant to the role addressed at that workshop. About 60 participants submitted out a spreadsheet with their responses for the workshop they attended.

The responses were analyzed for similarities and several roles were modified. For example, professors and researchers were grouped together to form one role as professors are typically involved in their groups’ research. After consideration of the participants’ responses, the RDaF team selected eight common job roles for the generation of sample profiles. These roles are AI expert, curator, budget/cost expert, data and IT expert, provider of data tools, publisher, research organization leader, and researcher.

For each sample profile, the RDaF team first calculated the percentage of responses that labeled a subtopic as relevant. When 50% or more of the respondents considered a subtopic to be relevant, it was presumptively deemed relevant for the sample profile. Next, the team considered all comments received with the profile responses as well as all the notes from the Stakeholder Workshop to further flesh out the sample profile. Lastly, the RDaF team consulted with experts in these roles to finalize the profiles.

2.2.3 Request for Information on Interim Version 1.5

Interim V1.5 of the RDaF was published in May 2023 [4]. This publication included the entire list of topics and subtopics for the six lifecycle stages, definitions, informative references for most of the subtopics, 14 overarching themes, and eight sample profiles.

The RDaF team developed a Request for Information (RFI) that was posted in the Federal Register on June 6, 2023, to communicate updates to the RDaF and receive additional feedback on V1.5. The public had 30 days after release of the RFI to comment on any aspect of the RDaF. The RDaF team reviewed and distilled the comments into almost 70 possible action items which were considered individually within the context of the intent of the framework. All comments received were considered in generating V2.0 of the framework.

2.3 Development of an Interactive Web Application

A web application has been developed and released that presents an interface to the RDaF components—lifecycle stages, topics, subtopics, definitions, informative references, overarching themes, and sample profiles—and thus replicates this RDaF V2.0 document in an interactive environment. In addition to providing an easy means of navigating through the various components and the relationships among them, the web application has new functionality such as the capability to link subtopics to their corresponding informative references and to direct a user to the original source of any reference.

The web application runs on a variety of platforms including Windows, MacOS, and Linux. Development of the software—database design, Entity Framework Core, web application framework, search strategies, and user interface—is the subject of a separate publication in preparation.

3 Framework Foundation – Lifecycle Stages, Topics, and Subtopics

The foundation of the RDaF consists of lifecycle stages, topics, and subtopics selected by the RDaF team using a vast amount of stakeholder input as described in Section 2. The RDaF research data lifecycle graphic depicted in Fig. 2 is cyclical rather than linear and has six stages defined below. Each stage is interconnected to all other stages, i.e., a stage can lead into any other stage. An organization or individual may initially approach the lifecycle from any stage and subsequently address any other stage. It is likely that an organization or individual will be involved in all lifecycle stages simultaneously, though with different levels of intensity or capacity.

Envision – This lifecycle stage encompasses a review of the overall strategies and drivers of an organization’s research data program. In this lifecycle stage, choices and decisions are made that together chart a high-level course of action to achieve desired organizational goals, including how the research data program is incorporated into an organization’s data governance strategy.

Plan – This lifecycle stage encompasses the activities associated with preparing for data acquisition, selection of data formats and storage solutions, and anticipation of data sharing and dissemination strategies and policies, including how a research data program is incorporated into an organization’s data management plan.

Generate/Acquire – This lifecycle stage covers the generation of raw research data, both experimentally and computationally, within an organization or by an individual, and the collection or acquisition of research data produced outside of an organization.

Process/Analyze – This lifecycle stage concerns the actions performed on generated or externally acquired research data to yield processed research data, typically using software, from which observations and conclusions can be made.

Share/Use/Reuse – This lifecycle stage outlines how raw and processed research data are disseminated, used, and reused within an organization or by an individual and any constraints or encouragements to use/reuse such data. This stage also includes the dissemination, use, and reuse of raw and processed research data outside an organization.

Preserve/Discard – This lifecycle stage delineates the end-of-use and end-of-life provisions for research data by an organization or individual and includes records management, archiving, and safe disposal.

A depiction of the six research data lifecycle stages which are envision, plan, generate/acquire, process/analyze, share/use/reuse, and preserve/discard. The lifecycle stages are arranged in a circle to represent their cyclic and interrelated nature

Fig. 2 — Research data framework lifecycle stages

Tables 1-6 presented below each cover one research data lifecycle stage and its associated topics and subtopics. The goal of the framework is to be comprehensive while remaining flexible. An organization or individual may find that not every topic and subtopic in a lifecycle stage is relevant to their work. The selection of subtopics to generate a profile for a job or function will be described in Section 5.

Many lexicons are used in the research data management space. Though the RDaF does not intend to introduce an entirely new vocabulary, it is important to be precise with the use of key terms. For each topic and subtopic, the RDaF provides definitions to assist users in understanding what tasks and responsibilities are associated with that topic or subtopic. To derive these definitions, the RDaF team performed a search of common data lexicons such as CODATA’s Research Data Management Terminology and Techopedia [5, 6]. Additionally, the team searched more broadly for common and research data management-specific definitions, including ones for the informative references that provide guidance in the implementation of the RDaF. Some definitions are general or commonly understood and as such have no references. The definitions were checked for consistency with stakeholder feedback. Individual researchers and organizations should keep in mind that these definitions are not prescriptive and consider their own context when determining whether the definitions provided are appropriate.

Table 1. Envision lifecycle stage

Envision: Topic

Subtopic

Definition

Data Governance – Strategic/Qualitative

Identification of goals and roles

An exercise to define the objectives of, and responsible individuals for, various aspects of research data management (RDM).

The policies, procedures, and processes pertaining to authority, control, and shared decision-making (planning, monitoring, and enforcement) over the management of data assets. [9, 10]

Vision and/or policy

Vision is an aspirational state an organization wishes to achieve with respect to RDM.
Policy is a set of recommended and sometimes mandatory high-level principles that establish a guiding framework for RDM. [7, 8]

Data management organization

An RDM infrastructure (RDMI) of human and capital resources that supports data-related activities, e.g., policies, planning, and sharing, as well as practices and projects, e.g., data acquisition, control, and protection. Groups or individuals managing data across multiple platforms will need to ensure alignment and interoperability across the infrastructure. [11]

Organizational values, including DEIA

A set of core beliefs that function as guides to what is seen as good and important in an organization and the guiding principles that provide an organization with purpose and direction. Values ideally include diversity, equity, and inclusion, and accessibility. [12, 13]

Data management value proposition

A clear statement that indicates exactly what benefits an organization will derive from an RDM program. [14]

Data needs assessment

An evaluation of the requirements of an organization regarding research data, e.g., storage and technical support for data-related activities.

Purpose and value of data

A clear statement of the need for, use of, and benefit derived from, research data.

Organization intent regarding FAIR data

The extent to which an organization supports the internal adoption and use of the FAIR data principles.

End-use support

Components of the RDMI within an organization that enable data to be prepared and processed for its ultimate application, including reuse.

Stewardship

The application of rigorous analyses and oversight to ensure that data assets meet the needs of users. [15]

Data Governance – Legal and Regulatory Compliance

Privacy

The practice of protecting and properly handling sensitive data, including personal, proprietary, and confidential data. [16]

The policies, procedures, and processes to manage and monitor an organization regulatory and legal responsibilities and risks pertaining to data. [10]

Ethics

Moral principles pertaining to data practices, e.g., analysis and dissemination, that have the potential to adversely impact people and society. For example, principles that promote minimizing bias and maintaining the privacy of personal data. See also the Global Data Ethics Project. [1719]

Safety and security assurance

The practice of protecting data assets from unauthorized access, theft, or corruption throughout their lifecycles. [20]

Inventory

A function that provides organizational capabilities for archiving data management such that data products can be grouped, searched, and identified for retrieval, statistics and reorganization. Also, a list of available items stored and/or controlled in a storage warehouse system. [15]

Risk assessment

A systematic process for the identification and evaluation of potential threats to, and vulnerabilities of, an organization’s data assets, e.g., unauthorized access to sensitive data. [22]

Risk mitigation and management

A process for the development and implementation of appropriate strategies to control, reduce, or eliminate potential threats to, and vulnerabilities of, an organization’s data assets as identified by a risk assessment. [23]

Sharing/licensing

Data sharing agreement: a formal contract that details what data are being shared and the appropriate use for the data.
Licensing agreement: a formal contract that states the purpose and duration of access being provided to the recipient licensee along with restriction and security protocols the recipient licensee of the data must follow. For intellectual property (IP), any agreement must include an assessment of what IP rights subsist in the data, who owns them, what exceptions or limitations apply, and any contractual rights or policies related to IP that should be considered within the data governance framework, including acquired and generated data as well as “background” (i.e., pre-existing) and “foreground” (i.e., from original research) IP. [2427]

Social license for use and reuse

An unwritten agreement whereby a group of public stakeholders accept that certain datasets may be applied for purposes other than those for which the data were originally intended, e.g., healthcare data. [28]

Jurisdiction for sharing and reuse

Legal requirements as set by an authoritative entity (e.g., local and national regions) concerning the dissemination of data by an organization and subsequent use of the data by other organizations. [29]

Data Culture and Reward Structure

Roles and responsibilities

The job functions and obligations that enable the establishment of a desired data culture and reward structure.

The collective beliefs and behaviors of the people in anorganization concerning the value and management of research data. Practices designed to recognize the advantages and accomplishments of sharing data.[30]

Recognition of data management

Processes and practices that provide acknowledgement and rewards for good RDM at all levels in an organization.

Value of data workers

Recognition of the benefits that staff performing data-centric jobs or functions provide to an organization.

Promotion and tenure

Career advancements that are linked to good research processes, practices, and outcomes.

Integrity of research and data

For research: The condition resulting from adherence to professional values and practices when conducting, reporting, applying, and disseminating results of the work. [31]
For data: The accuracy, completeness, and quality of data as they are maintained over time and across formats.[32]

FAIR data principles

Guidelines that allow digital objects (e.g., data, algorithms, and workflows) to be Findable, Accessible, Interoperable, and Reusable. [33]

Maintenance of FAIR data

Ongoing infrastructural support to sustain FAIR data principles and practices.

Incentives and impact for sharing and reuse

Staff recognition and rewards for widespread dissemination and application of research data and the beneficial effects of such dissemination.

Disincentives for sharing and reuse

Barriers that limit dissemination of data, e.g., misinterpretation and misuse of data by others, lack of recognition, and the effort required for sharing.

CARE and ethics

The CARE (Collective benefit, Authority to control, Responsibility, and Ethics) Principles for Indigenous Data Governance are people and purpose-oriented, reflecting the crucial role of data in advancing Indigenous innovation and self-determination. (These principles complement the existing FAIR principles for indigenous data governance.)
Ethics concerns moral principles pertaining to data practices, e.g., analysis and dissemination, that have the potential to adversely impact people and society. For example, principles that promote minimizing bias and maintaining the privacy of personal data. [17, 34]

Education and Workforce Development

Workforce
skills inventory

A catalog of an organization’s capabilities in essential data processes.

Training to provide staff with the necessary skills and expertise for data-related activities and RDM. Includes leadership support and formal and informal training.

Workforce preparedness in new and advanced technologies

Assessment of needs for, and provision of, training in the skills and expertise of an organization’s staff pertinent to novel and leading-edge areas of research, e.g., AI.

Data management training

In-classroom, on-line, and/or hands-on instruction for staff to attain the skills and expertise required to manage data across a lifecycle.

HR’s supporting role in workforce development and training

Involvement of an organization’s Human Resources (HR) department in establishing and implementing instructional courses for staff to expand their skill sets and expertise in research data programs and RDM.

Promotional paths and career development

Documented approaches for recruitment, advancement, and retention of staff in data-centric jobs in an organization and expansion of data-related skills and expertise for all technical jobs.

Resources—Allocation and Sustainability

Sources of funding

Entities that provide financial support for research data programs and RDM infrastructure (e.g., capital and human resources).

The distribution and longevity of funding to attain and maintain robust research data programs and RDM infrastructure.

Long-term funding

Sustained financial support for research data activities and RDM infrastructure.

Staffing

Provision of sufficient resources to support RDM staff and researchers engaged in RDM activities.

Community Engagement

Stakeholder communities

Individuals, groups, and organizations that have an interest or stake in RDM or research data in general, and in particular domains. [35]

Outreach and interactions among organizations or individuals with shared goals or interests concerning research data activities or RDM.

Modes of communication

Ways by which information about research data and data management are shared and discussed.

Partners/partnerships

Partner: Two or more organizations or individuals that share responsibility and control of ideas, processes, and outcomes of research data activities.
Partnership: An agreement between organizations and individuals to collaborate on such activities. [36]

Engagement across knowledge domains and sectors

Interactions among groups or individuals having expertise in different specific, specialized disciplines or fields, or expertise in different technology areas. [37]

Inclusivity in interactions

The practice of including all types of people or ideas and treating them all fairly and equally. [38]

Data services and the beneficiaries

Solutions for data tasks (e.g., data transfer, storage, and analytics) and the organizations or individuals deriving value from such solutions. [39]

Table 2. Plan lifecycle stage

Plan: Topic

Subtopic

Definition

Chain of Custody

Roles and responsibilities

The job functions and obligations for tracking data assets.

A complete, fully documented step-by-step history of a data asset in an organization, i.e., who has possession of a data asset, at what time, and for what purpose, at all times throughout the lifecycle of the data asset. [40]

Implementation authority

Person empowered to grant access to data assets, e.g., a Chief Data Officer.

Centralized inventory of services, groups, and resources

An organization-wide catalog of items supporting data-related activities at various levels of an organization, including capital (e.g., HPC), virtual (e.g., domain repositories), and human (e.g., Data Steward and AI interest group) components.

Provenance

The historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [15]

Financial Aspects of Planning

Funding models for provisioning resources

Approaches for providing financial support for data-related activities and infrastructure, including direct, (e.g., grants, contracts, and institutional), overhead, or mixed. [42]

Factors to consider in estimating or assessing the costs associated with all research data and RDM activities over the data lifecycle.

Funding sources

Entities that provide financial support for research data activities and infrastructure (e.g., capital and human resources).

Decision-making tools to assess costs

Methods to determine the financial requirements of various data activities and infrastructure, e.g., cost-benefit analysis, market analysis, and decision trees.

Cost-benefit analysis

A systematic approach to estimating the strengths and weaknesses of alternative actions to determine options which provide the best approach to achieving benefits while preserving savings. [43]

Cost breakdown by lifecycle stage

Identification of funds required for each data activity in a project (e.g., hardware, software, and staffing for data generation), or for an RDM infrastructure (e.g., centralized data services).

Downstream lifecycle costs

Funds required after establishment of an RDM infrastructure (e.g., technology refresh and maintenance) or for later-stage data activities (e.g., long-term preservation).

Staffing and training

Costs incurred in assuring that new staff with appropriate skills and expertise are hired for specific data activities and that existing staff attain new and advanced skills through instructional courses.

Data Management Planning

Written data management plans (DMPs)

Also known as Data Management and Sharing Plans (DMSPs), these documents provide information on the following topics: Administrative Data, Data Collection, Documentation and Metadata, Ethics and Legal Compliance, Storage and Backup, Selection and Preservation, Data Sharing, and Responsibilities and Resources. DMPs are living documents that should be updated as projects change and mature. [44, 45]

The process of organizing and specifying objectives and activities throughout the research data lifecycle.

Purpose/intent of research study and context of anticipated data use

Clear articulation of research objectives in terms of data products that are essential to address specific research and/or technical requirements.

Specification of data entities and actions throughout the lifecycle

Detailed descriptions of all information, processes, software, and hardware required from conception to completion of a research data project.

Machine-readable DMPs

Data management plan documents in a form that can be used and understood by a computer. DMPs may also be machine-actionable or in a form such that computers can be programmed against the structure. [46]

Linkage of DMPs to administrative records

Interconnection of a research data management plan to operational data, e.g., agreements, transactions.

Data organization to facilitate future access

The practice of categorizing, classifying, and storing data with sufficient detail and specificity such that the data are readily discoverable and usable by others. Examples include databases and repositories. [47]

Data management expertise and training

In-class, on-line, and/or hands-on instruction for staff to attain the skills and knowledge required to manage data in a research study.

Data Object

Quantitative and qualitative

Quantitative data are numerical data, e.g., measurements and some controlled observations and questionnaires. Qualitative data are defined as non-numerical data, e.g., text, videos, photographs, or audio recordings. [48]

An entity that, together with associated metadata, is produced or used in a research study. [15, pg 13]

Measurement

A quantity in various formats, including numerical, visual, and auditory.

Observation

A fact or occurrence often involving measurement with instruments. [49]

Survey

A list of questions aimed at extracting specific data from a particular group of people. [50]

Software

A computer-based application that converts inputs into outputs to support the user in one or more research tasks. [51]

Model

A representation, pattern, or mathematical description that can help scientists replicate a system, process, or research result. [52]

Documentation (text)

Comprehensive information that accompanies a dataset, including all associated metadata, a data dictionary, descriptions of methods, instruments and software used to generate/collect and process the data, and other supporting data (e.g., duplicate sample results, replicate analyses). [53]

Specimen (physical sample)

A tangible object that may observed or tested to determine its properties or characteristics.

Presentation

Material assembled to explain and describe research results or processes to an audience.

FAIR

Organizational support for making data more FAIR

Institutional resources to improve the extent of "FAIRness" of data. (FAIRness is used herein to denote a continuum state ranging from no FAIR aspects to fully FAIR.)

Findability, Accessibility, Interoperability, Reusability: a set of guiding principles to support the reusability of data that are beneficial to all scholarly digital research objects. [33,54]

Identification of methods/guidelines vis-à-vis FAIR principles

An exercise to locate techniques and recommended procedures related to FAIRness.

Data/Metadata Considerations

Criteria for selection of data/metadata

Requirements and needs by which decisions are made regarding what information to generate, collect, and document in a research study.

Factors to take into account prior to conducting a research study.

Nature of data/metadata required

Specification of the requisite types and characteristics of selected information.

Intended extent of FAIRness

The degree to which data and metadata are meant to comply with the FAIR data principles.

Methods to capture and store data/metadata

Techniques or means by which data/metadata are collected, recorded, and preserved.

Metadata schema

The overall structure of data about the data. Two examples of general-purpose metadata schema are Dublin Core and MODS (Metadata Object Description Schema). [55, 56]

Data Architecture

Design

A set of principles that are formulated from specific strategies, rules, models, and guidelines for the management and flow of a dataset throughout its lifecycle.

The fundamental structure of an organization's research data management (RDM) system embodied in its components, their relationships to each other and to the environment, and the principles guiding its design and evolution. Includes, for example, system interfaces, authentication mechanisms, data brokers, and monitoring platforms [60, 61]

Processing operations

Methodology for translating raw data into useable information. Specific methods include, e.g., data preparation, validation, sorting, aggregation, analysis, and reporting.

Workflow

The process of managing data in a structured manner. It involves collecting, organizing, and processing data so that they can be used for various purposes. [57]

Model

A detailed description or scaled representation of the relationships and data flow between different components of an RDM system, typically in the form of a diagram or flowchart. [58]

LIMS

A laboratory information management system (LIMS) is a software system developed to support laboratory operations (e.g., track specimens and workflows and aggregate datasets). [59]

Hosting and storage, cloud storage

Methods whereby, and locations wherein, data are saved and from which data can be retrieved.

Configuration management

The actions of tracking and controlling changes in the hardware and software components, e.g., updates and version control. [62]

Interoperability among different architectures

The capability to communicate, execute programs, or transfer data among different RDM systems in a useful and meaningful manner that requires the user to have little or no knowledge of the unique characteristics of those systems. [63]

Security

Features of the architecture that protect data from unauthorized access, denial of access, corruption, or theft throughout their entire lifecycles. [20]

Existing standards

Standards relevant to data architecture, including schema (e.g., based on SQL and JSON), format (e.g., JSON and, XML), and APIs (e.g., Google Search for the web).

Hardware and Software Infrastructure

Organizational research needs

Essential resources required to accomplish the objectives of research projects and RDM (e.g., centralized infrastructure, appropriate training, and support staff).

The physical and non-physical functional components that collectively form a foundation for conducting research and RDM.

Tools to support data-related processes

Items, e.g., instruments, methods, utility software, and APIs, that enable research.

Models that connect infrastructure to data processes and workflow

A detailed description or scaled representation of the relationships between data tasks and movement and the hardware and software components in an RDMI. [58]

Interoperability

The capability to seamlessly communicate, execute programs, or transfer data among various functional components, that requires the user to have little or no knowledge of the unique characteristics of those components. [63]

Persistent instrument identifiers

Globally unique, persistent, and resolvable identifiers of operational scientific instruments enable research data to be persistently associated with such crucial metadata, helping to set data into context. The Research Data Alliance’s Persistent Identification of Instruments Working Group (PIDINST) developed a metadata schema, prototyped implementation of the schema and demonstrated the viability of the proposed solution in practice. [64]

Sustainability of data vis-à-vis obsolete infrastructure

Concerns regarding the ability to reproduce and reuse data if the hardware and software components become outdated or non-functional.

Security and privacy considerations

Security: the degree of protection of data from unauthorized access, denial of access, corruption, or theft provided by the hardware and software. Privacy: the practice of protecting and properly handling sensitive data, including personal, proprietary, and confidential data. [20]

Staff expertise and support staff

Personnel with the appropriate skills and knowledge to maintain and update the hardware and software infrastructure as needed, and personnel to interface with researchers using the infrastructure.

Research Data Standards

Requirements and needs

Criteria by which decisions are made regarding the type of research standard, i.e., broadly applicable or limited to a particular field of research.

Documents, including codes, specifications, recommended practices, classifications, test methods and guides, that describe how data should be stored or exchanged for the consistent collection and interoperability of that data across different systems, sources, and users. [65, 67]

Sources of standards/guidelines for data/metadata

Origins of accepted practices consisting of discrete, reusable components, e.g., data types, identifiers, schemas, and formats. Examples include the Dublin Core Metadata Initiative and Schema.org. [65]

Quality standards

Guidelines that provide sufficient information to allow all users to readily evaluate the degree of “fitness for purpose” of the data. Key data quality components include completeness, accuracy, integrity, consistency, and timeliness. [15, pg 26, 57]

Community-based standards/conventions

Community-based data and metadata standards are typically long-term endeavors with many different players and types of efforts. Such standards facilitate reuse of data integrative analysis and comparison to other datasets and linkage of data with other research products, such as scholarly material, algorithms and software. [68]

Assessment

Goals/definition of success

Statement of project objectives; list of accomplishments demonstrating that these objectives were met.

Evaluation of the success of a research project against expectations set before the project has started.

Metrics for tracking use and impact measures, including reuse

Quantitative and qualitative indicators of positive influence or outcomes, e.g., number of citations of a dataset and anecdotal evidence of reuse of a dataset. [69]

Communication and Outreach

Methods to share and reuse data/metadata

Approaches to disseminate data/metadata and to facilitate reusability of data/metadata, e.g., use of open repositories and maximizing the FAIRness of data.

Engagement and interactions among groups and individuals working in similar research areas.

Allocation of credit to project team members

Properly documenting and recognizing each team member's contributions to a project. [70]

Promotion of data to communities of interest

Modes to communicate the existence and location of datasets to targeted groups, e.g., special-topic data publications and presentations at topical workshops.

Cross-institution cooperation

The process of working with other institutions or organizations on a shared activity (e.g., informal collaborations, formal partnerships, and agreements).

Requests for additional data from the research community

Solicitations of data contributions from partners and stakeholders on areas of mutual interest.

Access Control Associated with Data Sensitivity

Identification of responsible parties for access management

A determination of those individuals authorized to both prohibit and permit access to sensitive data.

Methods and requirements to limit the individuals or groups permitted to view or use protected data.

Ease of maintenance and implementation of records

The extent to which sensitive data can be kept up to date and made accessible to authorized individuals and groups.

Regulatory compliance

Efforts by organizations to ensure that they are aware of, and take steps to, conform to relevant laws, policies, and regulations concerning sensitive data (e.g., medical records). [71]

Sensitive data/PII

Data that needs to be controlled due to certain risks. Personally Identifiable Information (PII) is any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. [72]

Limited disclosure, IP

Restricting release of data to specific legal circumstances and often requiring notification to the data provider. Intellectual Property (IP) refers to certain exclusive rights granted by law to the owner of, e.g., a novel data product. For IP, any agreement must include an assessment of what IP rights subsist in the data, who owns them, what exceptions or limitations apply, and any contractual rights or policies related to IP that should be considered within the data governance framework, including acquired and generated data as well as “background” (i.e., pre-existing) and “foreground” (i.e., from original research) IP. [26, 27, 73]

Licensing for reuse

Legal agreement that allows one party to use another party's data subject to certain conditions.

Table 3. Generate/Acquire lifecycle stage

Generate/Acquire: Topic

Subtopic

Definition

Data Types

Measurement

A quantity in various formats, including numerical, visual, and auditory.

Classifications or categories of data. [74]

Text file

A type of digital, non-executable file that contains letters, numbers, symbols and/or a combination of these without any special formatting (e.g., ASCII, EBCDIC). [75]

Computation, simulation

Computation: an act, process, or method of computing. Simulation: any research or development project wherein a model of some authentic phenomenon is created to mimic outcomes that happen in the natural world. [76, 77]

Source code

A set of instructions and statements written by a programmer using a computer programming language. This code is later translated into machine language by a compiler. [78]

Observation

A fact or occurrence often involving measurement with instruments. [49]

Survey

A list of questions aimed at extracting specific data from a particular group of people. [50]

Transaction

Data that describe an exchange or transfer of goods, services, or funds. [79]

Social media

Interactive technologies that facilitate the creation and sharing of information (i.e., data) through virtual communities and networks. [80]

Data Sources

In-house generation by researchers

Data created by researchers within an organization and at a physical location internal to the organization.

Description of circumstances whereby data are produced. Origin of data.

Remote generation by researchers

Data created by researchers within an organization through control of an instrument or device at a location other than the organization.

In-field generation by researchers

Data created by researchers within an organization at a physical location external to the organization, which may be a natural environment.

User facility generation by/for researchers

Data created by researchers or facility staff at a federally sponsored research facility available for external use to advance scientific or technical knowledge. [81]

Historical

Data generated or collected in the past, which may have uncertainties due to, e.g., age and loss of metadata.

Human-annotated

The process of adding metadata or other information in different formats to data by a person such as labels or tags to describe the content or context of images, and labels or tags to classify or extract relevant information from text. Such annotation allows the AI and ML models to categorize the data and approve the execution of relevant tasks. [82]

Generated Experimental Data

Source of objects/subjects

Origin of items used in an experiment.

Data produced by automation or active intervention by a researcher to induce and measure changes or to create differences when a variable is altered. [83]

Characteristics of objects/subjects

Distinct features of items used in an experiment, e.g., appearance and properties.

Conditions of research study

Description of the external physical environment in which data were collected (e.g., temperature, atmosphere). Such conditions are types of metadata.

Specification of instruments and tools

Identification and documentation of measurement equipment and other items, e.g., software, methods, and materials, used in an experimental research study. Includes descriptions of the technical details and requirements of each item.

Parameters for instruments and tools

Variables or settings on an instrument or tool that are maintained and controlled during an experiment (e.g., laser intensity, gas flow rate, and rate of data collection).

Methods, protocols, and calibration

Techniques and procedures used in the generation of data.

Data/metadata capture methods

Techniques and procedures for collecting and recording information, for both short-term and long-term storage.

Provenance and capture methods

Techniques and procedures for collecting and recording the historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [21, 41]

Reproducibility

The ability to replicate data using identical tools (e.g., documented metadata, code, methods, and instruments) employed previously by the original researchers or by other researchers, without the need for any additional information or communication with the original researchers. [84, 85]

Generated Computational Data

Input data/metadata

Information of any type that is entered manually or via an automated process into an instrument, computer, or other device.

Data produced by using calculations, models, simulations, or other methods. Can be produced manually or using a computer or other type of system or device. [76, 77]

Output data/metadata

Electronic data produced by an instrument, processor, computer, or other device.

Hardware

The physical components that make up a computer or electronic system and everything else involved that is physically tangible, including monitors, hard drives, memory, and the CPU. [86]

Parameters and conditions for computation

Hardware or software system requirements or configurations that are necessary for a hardware or software application to run smoothly and efficiently, e.g., operating system dependencies, compilers, and memory requirements. [87]

Versioning

The process of numbering different releases of entities, e.g., software, hardware, and documents, for the purposes of tracking and recording changes. This provides the ability to revert to a previous revision, which is critical for data traceability and data re-creation, tracking edits, and correcting errors. [88, 89]

Data/metadata capture methods

Techniques and procedures by which information is collected and recorded.

Provenance and capture methods

Techniques and procedures for collecting and recording the historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [15, pg 24, 31]

Verification/validation of output data

Verification: the process of determining that a computational model accurately represents the underlying mathematical model and its solution. Validation: the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model. [90]

Qualitative Data

Nature of objects/subjects

Types and characteristics of entities which are being studied.

Data that are descriptive and concern phenomena which can be observed but not measured.

Methods and protocols

Techniques, standard operating procedures, sets of rules, and guidelines.

Metadata

Data about data, i.e., data that define and describe the characteristics of other data. Using a survey as an example, metadata include the questions in, and location of, the survey. [91]

Paradata

Data about the process by which data were collected. Formalized data on methodologies, processes, and quality associated with the production and assembly of statistical data. Using a survey as an example, paradata include the mode of the survey and responders' response times. Note that paradata are typically associated with social science disciplines; in physical and medical science disciplines, paradata would be included in metadata. [92, 93]

Data/metadata/paradata capture methods

Techniques and procedures for collecting and recording any type of data, either manually or via an automated process using an instrument, computer, or other device.

Acquired Data

From collaborators

Originating from other individuals or other organizations partnering with researchers in an organization.

Data used in a research study that were not generated by the researchers conducting the study.

From repositories

Originating from a destination designated for data storage. Operations of a repository include preservation, management, and provision of access for digital materials that may have different types and formats. [94]

From the literature

Originating from a publication.

Aggregated datasets from multiple sources

Data compiled from disparate studies that are organized, and summarized so that conclusions can be drawn, and decisions made, from such data-rich collections.

Provenance

The historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [21, 41]

Restrictions, fees, and usage agreements

Mechanisms that may limit the use of acquired data.

Critically Evaluated (CE) Data

Infrastructure to assure the greatest data integrity

A foundation composed of practices, processes, and procedures designed to produce data that are clean, traceable, and fit for purpose. NIST and KRISS are two institutions that produce critically evaluated data named Standard Reference Data. [95]

Numerical data that have undergone rigorous review and critique such that the integrity, reasonableness, and usability are optimized. [96]

Single researcher dataset

A group of data that originates from an individual researcher.

Aggregation of data evaluated by experts

The process by which data from disparate sources are compiled, reviewed, critiqued, and summarized by subject matter experts.

Reproducibility and uncertainty quantification

Reproducibility: The ability to replicate data using identical tools (e.g., documented metadata, code, methods, and instruments) employed previously by the original researchers or by other researchers without the need for any additional information or communication with the original researchers. Uncertainty quantification: Assignment of a numerical value to a non-negative parameter characterizing the dispersion of the quantity values being attributed to a measurand. Critically evaluated data have great reproducibility and small uncertainty. [84, 85]

Intellectual property rights

Legally enforceable claims for owners of original ideas, inventions, and creative expressions. For intellectual property (IP), any agreement must include an assessment of what IP rights subsist in the data, who owns them, what exceptions or limitations apply, and any contractual rights or policies related to IP that should be considered within the data governance framework, including acquired and generated data as well as “background” (i.e., pre-existing) and “foreground” (i.e., from original research) IP. [26, 27, 97]

FAIR Principles

Data born FAIR

Data objects that comply with the FAIR principles when first generated or produced.

Findability, Accessibility, Interoperability, Reusability: four concise and measurable guidelines designed and broadly endorsed to support the reusability of data. Standards may be created that align with the FAIR principles but are not recognized standards.

Data made FAIR

Data objects that are transformed or changed in some manner so that they comply with the FAIR principles.

FAIR digital objects

Standardized, autonomous, and persistent entities which contain the information needed about different kinds of digital objects (e.g., data, metadata, documents, software, and semantic assertions), to enable both humans and machines to Find, Access, Interoperate, and Reuse (FAIR) these digital objects in highly efficient and cost-effective ways. [98]

FAIR on a continuous scale

Recognition that there is a degree of FAIRness that ranges from fully FAIR to not FAIR, that may be represented on a numerical scale.

Guidelines/methodologies for each aspect: F, A, I, R

Means, e.g., standards, best practices, protocols, and software, by which the findability, accessibility, interoperability, and reusability of data may be improved.

Tools to capture FAIR provenance

Techniques and procedures for collecting and recording the collective information on the FAIRness of a data asset, from its origin to the present.

FAIR instruments and tools

Equipment, devices, methods, standards, and other tools that enable the findability, accessibility, interoperability, and reusability of data (e.g., SmartAPI). [99]

Not FAIR data

Data that are not findable, accessible, interoperable, and reusable to any degree for various reasons, e.g., obtained using old or obsolete instruments or software.

Community-Based Standards

General vs. domain-specific

Broadly applicable as opposed to limited to a particular field or area.

Documents, including codes, specifications, recommended practices, classifications, test methods, and guides, that are developed by a group with common interests.

Standards development organizations vs. community consensus

Formal, recognized, standards bodies (e.g., ISO and ASTM International), as opposed to informal, self-assembled groups of individuals or institutions with shared interests (e.g., professional societies).

Data format and file structure

Data format: the organization of data according to preset specifications. File structure: The manner by which data and code are organized within a file with the goal of reusability. In the context of standards, the syntax, encoding, and file format or media type for storing or transmitting data (e.g., CSV and JSON). [65, 100102]

Metadata format and file structure

Metadata format: the organization of information metadata according to preset specifications.
File structure: the manner by which metadata are organized within a file. In the context of standards, a metadata standard is a high-level document which establishes a common way of structuring and understanding data and includes principles and implementation guidance for utilizing the standard. See the RDA Metadata Standards Catalog. [100, 101, 103, 104]

Vocabulary and ontology

Vocabulary: a compendium of standardized terms with consistent semantic definitions. Ontology: a description of data structure (e.g., classes, properties, and relationships in a domain of knowledge. [65, 105]

Interoperability

The capability to seamlessly communicate, execute programs, or transfer data among various functional components that requires the user to have little or no knowledge of the unique characteristics of those components. Interoperability standards enable the operational processes underlying exchange and sharing of information between different systems to ensure all digital research outputs are Findable, Accessible, Interoperable and Reusable, according to the FAIR principles. [63, 106]

Acquisition Software
Computer programs that enable the collection and procurement of data.

Open source vs. proprietary

Programs freely distributed with the source code that researchers can modify and subsequently redistribute modified versions thereof vs. programs that are copyrighted and bear limits against use, distribution and modification that are imposed by their publisher, vendor, or developer. Such programs remain the property of their owner/creator and are used by end-users/organizations under predefined conditions. [107, 108]

LIMS

A laboratory information management system (LIMS) is a software system developed to support laboratory operations, e.g., track specimens and workflows, and collect, annotate, and aggregate datasets). [59]

Instrument control

Software for configuring the operating parameters of an instrument.

Electronic laboratory notebook

A software tool that digitally replicates paper laboratory notebooks traditionally used in the sciences to record information on observational, experimental, and computational studies. [109]

Audio and video recording

A digital record used to store and preserve the audible and/or visual components of an event.

Table 4. Process/Analyze lifecycle stage

Process/Analyze: Topic

Subtopic

Definition

Types of Processed Data

Tables, spreadsheets

Tables: numerical and textual information arranged in rows and columns. Spreadsheets: computer programs that can capture, display and manipulate data arranged in rows and columns.

Classifications or categories of data. [74]

Charts, graphs

Visual representations of datasets, e.g., diagrams, pictures, and graphs. Graphical charts show mathematical relationships between varied groups of data. [110]

Maps, vectors, images

Representations of the relationships between variables, i.e., quantities, phenomena, or entities. Maps: diagrammatic depictions of the association of two or three variables. Vectors: linear depictions of two independent variables; Images: visual representations of an object in two or three dimensions.

Instrument outputs

Raw electronic data generated by a piece of equipment, device, or other tool before any human action on the data and before any processing of the data. [111]

Dynamic data

Data which are changing frequently and at asynchronous moments. Data that may change after they are recorded and have to be continually updated. [112, 113]

Datasets from models and simulations

Organized collections of data generated by models (I.e., representations, patterns, or mathematical descriptions that can help scientists replicate a system, process, or research result) and simulations (i.e., creation of a model of some authentic phenomenon to mimic outcomes that happen in the natural world.) [52, 76, 114, 115]

Structured data

Data whose elements have been organized (e.g., hierarchical) into a consistent format and data structure within a defined data model such that the elements can be easily addressed, organized, and accessed in various combinations to make better use of the information (e.g., a relational database). [116]

Preparation and Pre-Processing Methods

Data cleaning

The process of detecting and correcting corrupt or inaccurate records from a dataset. This process involves identifying, replacing, modifying, or deleting incomplete, incorrect, inaccurate, inconsistent, irrelevant, and improperly formatted data. [117]

Techniques by which raw data are transformed into complete datasets with consistent formatting such that data analysis can subsequently be performed. [119]

De-identification, anonymization

A process by which personal data are irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party. [118]

Amputation and imputation

Amputation: a process whereby some valid data points are selectively deleted from a complete dataset. Imputation: a process used to determine and assign replacement values for missing, invalid, or inconsistent data. [120, 121]

Aggregation

A process used to combine datasets, typically taken collectively or in the form of a summary. Integration of data by aggregation requires data interoperability,
harmonization, and mapping.[122]

Validation and verification

Validation: the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model. Verification: the process of determining that a computational model accurately represents the underlying mathematical model and its solution. [90, 123]

Curation

The ongoing processing and maintenance of data throughout their lifecycle to ensure long-term accessibility, sharing, and preservation. Data curation is composed of research data management and digital preservation and involves processes such as the addition of metadata to make data more findable and understandable, ingestion of data into a repository, validation of file checksums and file fixity checks, and other tasks for organizing, cleaning, describing, enhancing, storing, and preserving data. [124]

Normalization of metadata

The adjustment of metadata elements into standard formats. [125]

Analysis Methods

Manual

Collection, organization, and transformation of data by a human without using a machine or any other tool. [126]

Statistical and/or logical techniques that are systematically applied to describe and illustrate, condense and recap, and evaluate and interpret data, with the goal of producing new, meaningful information. [74]

Exploratory

Techniques that typically use visual tools to, e.g., determine the main characteristics of datasets, find relationships among datasets or variables that may have been unknown or overlooked, and discern trends or differences among datasets. [126, 127]

Descriptive

Techniques for answering the question, "What happened?", e.g., identifying trends and relationships using current and historical (past) data. [128]

Diagnostic

Techniques for answering the question, "Why did this happen?", e.g., determining the causes of trends and correlations among datasets or variables. [129]

Evaluative

Techniques for a systematic determination of merit, worth, value, or significance of datasets, e.g., relevance to the project objectives. [130]

Predictive

Techniques for answering the question, "What might happen in the future?", e.g., making assumptions about the future using historical data, either manually or with machine-learning algorithms. [131]

Prescriptive

Techniques for answering the question, "What should we do next?", e.g., informing an optimal course of action, decisions and strategies, often via machine learning. [132]

Correlational

Techniques that provide a statistical measure indicating how strongly two variables are related and whether that relationship is positive (e.g., when one variable increases, the other also increases) or negative (e.g., when one variable increases, the other decreases). [133135]

Statistical

Techniques whereby data are interpreted to uncover patterns and trends. The five basic statistical techniques are mean, standard deviation, regression, hypothesis testing, and sample size determination. [136, 137]

Automated, autonomous

Techniques that require no human guidance or direct intervention and are based solely on machines, e.g., self-driving vehicles. [138]

Modeling

Visualization

Techniques for the representation of data (e.g., graphs, images, and diagrams). Transformation of numerical data into a visual or pictorial context in order to assist users in better understanding what the data mean. [122, 139]

A class of computational methods whereby a representation, pattern, or mathematical description is used to replicate a system, process, or research result. [52]

ML, AI

Machine learning (ML) is a methodology that uses statistics and mathematical models to detect patterns in historical data and learning algorithms to make predictions about new data. Artificial intelligence (AI) is a field of study in which computerized systems can learn, solve problems, and autonomously achieve goals under varying (and sometimes uncertain) conditions. ML is a subset of AI strategies. [140, 141]

Iterative model fitting

A technique whereby the parameters of a model are adjusted in repeated cycles to improve accuracy of the computation. [142]

Integrated development environment

An application that facilitates application development, typically via a graphical user interface (GUI)-based workbench designed to build software applications in combination with all the required tools, e.g., Jupyter and Rstudio. Common features include, e.g., debugging, version control, and data structure browsing. [143]

Metadata

Types of metadata

The three main categories or classifications of metadata are descriptive, structural, and administrative. [144]

Data about data, i.e., data that define and describe the characteristics of other data. [91]

Responsible parties

Individuals whose duties or job functions include the management of metadata, e.g., data owner or metadata steward. [145]

Specification of metadata standards

Identification and description of those metadata standards categorized as four types: format/technical interchange, structure, content, and value. Standards include recommended practices, classifications, test methods, and guides. [146]

Linked data structure

A deliberate design for the organization of data (structure) wherein information (metadata) is brought together from different sources (linked) to create a new, richer dataset. [147]

Persistent identifiers

A unique and long-lasting reference that allows for continued access to an entity (e.g., document, dataset, instrument, webpage, contributor, and organization). A persistent identifier (PID) may be connected to a set of metadata describing an object rather than to the object itself. Examples of PIDs include DOI, ORCID, ARK, ROR, PIDINST, and Handles. [148, 149]

Provenance

Original authoritative copy

The single, distinct, absolute version of a dataset from the originating source that is unique, identifiable, and unalterable without detection. It should be sufficient to allow a third party to reproduce the results of the research. [150]

The historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [21, 41]

Version identification

For a specific time, definitive determination of a previous dataset made possible by comprehensive information (e.g., raw data, computer code, software, and documentation) on that dataset. Such an ability to revert to a previous version is critical for data traceability, tracking edits, and correcting mistakes. [88]

Derivative product

Any data, publication, illustration or visualization, or other work that rearranges, presents, or otherwise makes use of an existing dataset. [151]

Aggregation

A process used to combine datasets, typically resulting in a collection or summary. [122]

Subset

A portion of a dataset that is referentially intact. [152]

Timestamp

Temporal information regarding an event that is recorded by a computer and then stored as a log or metadata. [153]

CRediT taxonomy

Contributor Roles Taxonomy (CRediT) consists of a high-level taxonomy, including 14 roles, that can be used to represent the roles typically played by contributors to research outputs. [154]

Software

Commercial vs. custom

Commercial software is any software or program designed and developed for licensing or sale to end-users or for serving a commercial purpose (e.g., off-the-shelf programs and games). Custom software is made for an individual or organization and performs tasks specific to their needs. [155, 156]

A set of instructions, data, or programs used to operate computers and execute specific tasks. [157]

Open source vs. proprietary

Open source typically refers to software that is freely distributed with source code that can modified by users and modified versions may be redistributed. Proprietary typically refers to software that is copyrighted and bears limits against use, distribution, and modification that are imposed by its publisher, vendor or developer. The software remains the property of its owner/creator and is used by end-users/organizations under predefined conditions. [107, 108]

Aggregation tools

Software or programs that enable the combination of datasets. [122]

Surveying tools

Software or programs that aid in the gathering of responses to questions aimed at extracting specific data from a particular group. [50]

Statistical tools

Software or programs used in statistics, i.e., the collection, organization, analysis, interpretation, and presentation of masses of data. [158]

Calculation and analysis tools

Software or programs that produce knowledge from organized data to draw conclusions, highlight useful information, and support decision-making.

APIs

An Application Programming Interface (API) is a set of protocols, routines, functions and/or commands that programmers use to facilitate interactions between distinct software services. [159]

Database management tools

Software or programs that aggregate diverse data into a database or other consistent resource, handle different types of queries, provide security, and perform other functions. [160]

Testing and validation tools

Methods to determine if software or programs perform the function for which they were designed. Software or programs that help ensure that the data sent to connected applications are complete, accurate, secure, and consistent. [161]

Documentation

Written information that describes the software product to the people who develop, deploy and use it, including technical manuals and online material, such as online versions of manuals and help capabilities. The term is sometimes used to refer to source information about the product discussed in design documentation, code comments, white papers and session notes. [162]

Reproducibility and uncertainty quantification

Reproducibility: the ability to replicate data using identical tools (e.g., documented metadata, code, methods, and instruments) employed previously by the original researchers or by other researchers without the need for any additional information or communication with the original researchers. Uncertainty quantification: assignment of a numerical value to a non-negative parameter characterizing the dispersion of the quantity values being attributed to a measurand. [84, 85]

Versioning and maintenance

The process of numbering different releases of a
software product based on the date of release for both internal use and release designation. This process allows programmers to know when changes have been made and track changes enforced in the software. At the same time, it enables potential customers to be acquainted with new releases and recognize the updated versions. [89]

Systems resilience and adaptability

Resilience: the ability of a software system to continue to operate under adverse conditions while maintaining essential operational capabilities, and to recover to an effective operational state in an acceptable time frame. Adaptability: the ability of a software system to tolerate changes in its environment without external intervention. [163, 164]

Source code repository

A storage location for source code (the fundamental component of a computer program) that holds code, makes code available for use, and organizes code in a logical manner. [165, 166]

Security and software updates

Patch, upgrade, or other modification to code that corrects security and/or functionality problems in software. [167]

Standards, protocols, and interfaces

Standards: codes, programs, and associated documentation that describe how data should be stored or exchanged for the consistent collection and interoperability of that data across different systems, sources, and users. Protocols: sets of rules and guidelines. Interfaces: programs that allow a user to interact with computers in person or over a network, or the controls used in a program that allow the user to interact with the program. [168170]

Workflow and Middleware

LIMS

A laboratory information management system (LIMS) is a software system developed to support laboratory operation (e.g., track specimens, collect and annotate data and workflows, and aggregate datasets). [59]

Workflow is a depiction of a sequence of connected operations or "steps" that illustrates how data flows through an RDMI. A workflow includes tasks, people involved, tools input, and output for each step. Middleware is a software layer or "glue" situated between applications and operating systems that makes it easier for software developers to perform communication and input/output, so they can focus on the specific purpose of their application. [173175]

Laboratory notebook

A complete, detailed record of the hardware, software, procedures, materials, observations, and relevant thought processes for the research which would enable the work and resulting data to be reproducible. This typically includes an explanation of why the research was done, including any necessary background and references, how the research was performed, the actual data (raw and processed), and where the data are stored. Laboratory notebooks may be paper or electronic. [171]

Tools for automated metadata capture

Software, hardware, and methods used to collect and record data about data without the need for manual instruction.

Anomaly detection and correction tools

Software, hardware, and methods used to identify items (e.g., operations, observations, events, and results) that do not conform to the expected pattern or result (i.e., anomaly detection) and to restore such items to the expected pattern or result (i.e., anomaly correction). [172]

Collaboration tools

Software and/or software systems that enable communication and sharing of documents, data, analyses, and/or visualizations amongst individuals who are not co-located.

Decisions regarding the need for additional data

Conclusions by researchers that more data are needed to accomplish project goals.

Process monitoring and evaluation

Periodic tracking of the operation and results of a workflow component by systematically gathering and analyzing data to assure that the component is functioning properly. [176]

Containerization

Operating system-level virtualization or application-level virtualization over multiple network resources so that software applications can run in isolated user spaces called containers in any cloud or non-cloud environment, regardless of type or vendor. [177]

Reusable workflow component

A discrete piece of software that can be compiled and packaged as an activity and reused in multiple processes, thereby reducing duplication and enabling sharing of the software with others. [178]

Microservices

An approach to software development in which a large application is built from modular software components (i.e., microservices), each of which does one defined job (e.g., messaging). [179]

Distributed workflow across sites

Computerized information system that is responsible for scheduling and synchronizing the various tasks within the workflow across physical or virtual locations, in accordance with specified task dependencies, and for sending each task to the respective processing entity. [180]

Comprehensive report generation

The production of a single document which includes all the information needed to reproduce a dataset, including, e.g., methods, format standards, and software versions.

Hardware

Compute requirements

Specifications of the raw processing power of a computer to meet the needs for activities, applications, or workloads. Such power may be characterized as the rate at which operations are performed, e.g., million instructions per second (MIPS). [181, 182]

The physical components that make up a computer or electronic system and everything else involved that is physically tangible such as peripheral devices. [86]

Storage requirements

Specifications and needs for devices and components that store data on a long-term basis for later uses and access (e.g., hard disks and network-attached storage devices). In contrast to storage, memory is the short-term location for temporary data storage. [183]

Network requirements

Network capability is characterized by stability of the signal, throughput (transfer rate of data from a source system to a destination system), and bandwidth (the amount of data that can be transferred per second, in megabits/sec). [184]

Accelerator requirements

Specifications and needs for hardware devices designed to improve the overall performance of the computer. Hardware acceleration is a process where applications offload certain computing tasks to specialized hardware components within the system, enabling greater performance and efficiency. [185, 186]

Table 5. Share/Use/Reuse lifecycle stage

Share/Use/Reuse: Topic

Subtopic

Definition

Publishing

Repository

A broad term that refers to a designated location where a collection of digital objects is stored in an organized manner such that the collection is findable, searchable, accessible, and reusable. Types of repositories include domain-specific (e.g., discipline or subject matter); generalist (a variety of data types, format, and content); and institutional (i.e., within an organization). [94, 187, 188]

Public disclosure of research datasets and supporting data objects, e.g., associated metadata and software code, in a manner such that the datasets are findable and reusable for others for future research. Published datasets ideally have a persistent identifier. [190]

Data paper

A publication that contains datasets, without having to be at the stage of presenting further analysis and conclusions as in a traditional research paper. [189]

Software

A set of instructions, data, or programs used to operate computers and execute specific tasks. [157]

Updates to datasets and new software versions

To datasets: the functional process of renewing information already contained in a database or stored elsewhere that results in the creation of a new record and may result in storage of existing data as history. To software: patch, upgrade, or other modification to code that corrects functionality problems in software. [167, 191]

Data linking

The process of collating and cross-referencing data from different sources in to create a more valuable and meaningful dataset. [192]

Persistent identifier

A long-lasting and unique reference to a digital object of various types (e.g., document, dataset, and webpage). Persistent identifiers (PIDs) are labels that locate, identify, and share information about digital objects. A PID may be connected to a set of metadata describing an object rather than to the object itself. [148, 149]

Metadata

Data about data, i.e., data that define and describe the characteristics of other data. [91]

Integrity of data

The reliability and trustworthiness of data throughout their lifecycle. The assurance that a digital object is uncorrupted and can only be accessed or modified by those authorized to do so. [74, 193]

Quality measures and assessment vis-à-vis fit for purpose

The degree to which a dataset meets the requirements for its planned usage as determined by an evaluation of quality metrics (e.g., accuracy, completeness, consistency, and timeliness). [194]

Peer review of datasets and metadata

An editorial process prior to publication of a dataset whereby people with a similar degree of expertise and experience as the author review and provide input on the integrity and quality of the dataset.

Reference data/digital objects in journal articles

Journals have different guidelines concerning the publication of digital objects, e.g., raw data and software, that accompany a traditional article. Examples of these guidelines are depositing data in a relevant repository, citing a dataset by its PID, and linking the dataset to the article. [195]

Curation

The ongoing processing and maintenance of data throughout their lifecycle to ensure long-term accessibility, sharing, and preservation. Data curation is composed of research data management and digital preservation and involves processes such as adding metadata to make data more findable and understandable, ingesting data into a repository, validating file checksums and file fixity checks, and other tasks for organizing, cleaning, describing, enhancing, storing, and preserving data. [124]

Publisher agreements and policies

Legal documents that are used to dictate when and how work is published and thereby protect an author’s intellectual property from unauthorized use or reproduction. Open access agreements support individual authors to publish open access data at no cost to themselves. Publisher policies are set by the publisher and include, e.g., copyright and licensing, data privacy, and rights and permissions. [196198]

Incentives for data publishing

Staff recognition and rewards for widespread dissemination of research data.

Mitigation of disincentives for data publishing

Practices to remove or reduce barriers that limit dissemination of data (e.g., misinterpretation and misuse of data by others, and lack of recognition and effort for sharing).

Modes of Dissemination

Traditional journal article

A scholarly manuscript submitted to a journal that undergoes a peer review process, an editing and copy-editing process, and finally distribution by publishers able to print and make high-quality scholarly works available to the world. Such manuscripts typically contain analysis and conclusions, but not digital data objects, e.g., raw data and software. [199]

Means by which journal articles, datasets, and other data objects are publicly released.

Supplementary material

Peer-reviewed material directly relevant to the conclusions of a manuscript that cannot be included in the printed version for reasons of space or medium (e.g., video clips or sound files). [200]

On request

Making data available in response to queries typically sent by email. The requester may be required to complete a form, e.g., a data release application agreement. [201]

Data landing page

A standalone web page that a person accesses after clicking on a link from an email, ad, or other digital location. For a dataset, such a web page typically includes a narrative description of the dataset and files or links to files pertaining to the dataset, e.g., the dataset itself and the software used to generate the dataset. [202]

Workflow

A depiction of a sequence of connected operations or steps that illustrates how data flows through a research data management infrastructure. A workflow includes tasks, people involved, tools (e.g., hardware and software), input, and output for each step. [173]

Mainstream media

Traditional means of communication, such as newspapers, television, and radio, that influence large numbers of people. [203]

Social media

A catch-all term for a variety of internet applications that allow users to create content and interact with each other, e.g., Twitter, Instagram, Facebook, and LinkedIn. [204]

Attribution
Acknowledgement of the use of an individual's published articles, data, or other data objects.

Citation metrics

Measures based on the number of times a single entity (e.g., article and dataset) published by a researcher is mentioned in the published work of other authors. Indicator of the quality or importance of a published entity. Citation data are available from citation databases, discipline-specific databases, and through an emerging range of alternative metrics. [205]

Citation impact

Quantitative and qualitative tools and methods to measure the impact of an individual's collective work. Quantitative tools, include citation analysis—counting the number of times other authors mention a researcher's published works; the impact factors (IFs) of the journals in which a researcher has published their work (IF is the frequency with which the average article in a journal has been cited in a particular year); and the h-Index for a researcher, which is based on the set of the researcher's most cited papers and the number of citations those papers have received in other authors' publications. Qualitative methods to measure impact include anecdotal evidence. [206, 207]

Dataset citation

The practice of referencing data products used in research (e.g., a DOI or key descriptive information about the data, such as the title, source, and responsible parties). Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse. (See the Joint Declaration of Data Citation Principles.) [208210]

Provenance

The historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [21, 41]

Author identity management

Use of a persistent, unique, digital researcher identifier such as ORCID to, e.g., track the scholarly outputs of a researcher, assign appropriate author credit, and eliminate author name ambiguity. [211]

Use of persistent identifiers

The practice of assigning a unique and long-lasting reference that allows for continued access to a data asset. [148, 149]

Versioning

The process of numbering different releases of a data asset (e.g., a software program and database); the use and management of multiple versions of a document. Version control allows for the ability to revert to a previous revision, which is critical for data traceability, tracking edits, and correcting mistakes. [88, 89, 157]

Modes of Sharing

Standardized formats

The organization of information according to preset specifications that are agreed upon by formal standards bodies or informal community groups.

Methods whereby datasets and other digital objects are publicly or privately distributed or are accessible to others upon request.

Interoperability tools

Methods that provide the capability to seamlessly communicate, execute programs, or transfer data among various functional components in a useful and meaningful manner that requires the user to have little or no knowledge of the unique characteristics of those components. [63]

Discovery platforms

Software systems that use metadata to identify and recommend sources of data or other digital objects. [212]

Catalogs

Completely organized services that enable any user, e.g., analysts, data scientists, and developers, to discover, explore, and use data assets. [213]

Registries of repositories

Databases containing information about trusted repositories that are provided by the repository managers and are useful for human and machine users, e.g., the Re3data Repository Registry and the NIST Materials Resource Registry. [214216]

Access

Internal access

The ability of individuals in an organization to view and retrieve data and other digital objects that were generated, collected, or processed by an individual or group in the same organization.

The ability of a user to view and retrieve data and other digital objects stored within a database or other repository. Users who have data access can store, retrieve, move or manipulate data, which can be retained on a wide range of hard drives and external devices. [217]

External access

The ability of individuals in organizations other than the organization that generated, collected, or processed the data and other digital objects to view and retrieve such digital resources.

Programmatic access

The ability of a user to view and retrieve data made possible by an Application Programming Interface (API), which is a set of protocols, routines, functions and/or commands that programmers use to facilitate interaction between distinct software services. [159]

Virtual and physical enclaves

Secure networks through which confidential data, such as personally identifiable information from census data, can be stored and disseminated. In a virtual data enclave, a researcher can access data from their own computer but cannot download or remove the data from the remote server. Higher security data can be accessed through a physical data enclave wherein a researcher is required to access data from a monitored room where the data are stored on non-networked computers. [218]

Access vs. visiting

Data visiting is an approach whereby sensitive data stays under the control of the owner and consumers (e.g., analysts or machine learning algorithms) are permitted to work with the data on location. With data access, users can store, retrieve, move, or manipulate stored data. [219]

Availability statement

A declaration letting a user know where and how to access data that support the results and analysis of a published study. A declaration may include links to publicly accessible datasets that were analyzed or generated during the study, descriptions of what data are available and/or information on how to access data that are not publicly available. [220]

Mitigation of barriers and economic constraints

Practices that reduce or eliminate programmatic and administrative constraints and transactional costs of accessing data.

Legal and Licenses

Ownership

The act of having legal rights and complete control over data assets. Ownership defines and provides information about the rightful owner of data assets and the acquisition, use and distribution policy implemented by the data owner. [221]

Juridical and regulatory issues as pertaining to research data.

Encouragement and support for sharing, use, and reuse

Incentives and human and infrastructural resources that increase the quantity and quality of data assets for access and dissemination.

Indigenous data rights

Indigenous data sovereignty (IDS) refers to the right of Indigenous peoples to govern the collection, ownership, and application of data about Indigenous communities, peoples, lands, and resources. IDS encompasses data, information, and knowledge about Indigenous individuals, collectives, entities, lifeways, cultures, lands, and resources. [34]

Intellectual property rights/restrictions

Intellectual property (IP) is something of value (an asset) that is created from an original idea, invention, or creative expression. IP rights are legally enforceable claims for owners of such items, including data products (e.g., software). An IP agreement must include an assessment of what IP rights subsist in the data, who owns them, what exceptions or limitations apply, and any contractual rights or policies related to IP that should be considered within the data governance framework, including acquired and generated data as well as “background” (i.e., pre-existing) and “foreground” (i.e., from original research) IP. [2427] [222]

Usage agreements/terms/licenses and required permissions

Usage agreements: legally binding contracts between an originator of a digital object and a user of the object that spell out the rights and responsibilities of all involved parties. User licenses: written contracts that give a user permission to work on another party's digital object under a certain set of conditions and typically requires that the user pay a royalty fee. [223, 224]

Data sharing and licensing agreements

Sharing agreements: formal contracts that detail what data are being shared and the appropriate use of the data and include provisions concerning access and dissemination. Licensing agreements: documents that describe what kind of data are being shared with a user and clearly state the purpose and duration of access being provided to the user along with restrictions and security protocols that the user of the data must follow. [24, 25]

Service-level agreements

Contracts between two parties that define and measure the level of service a data provider will deliver to a user The agreements aim to define expectations of the level of service and quality between data providers and users. [225]

Terms of service

Legal agreements between a data service provider and a user that detail the set of rules and regulations a provider attaches to a software service or web-delivered product. [226]

Standardized, machine-actionable license documents

Written contracts in a common, agreed-upon form that can be read, understood, and implemented by a computer. Such contracts give a user permission to use a creator's digital object under a certain set of conditions.

Citation requirements

References to data and other digital objects that are mandated by a data provider, formal agreement, or publishing entity.

Levels of Protection

Unclassified but sensitive information

A designation of information (data) in the US federal government that is not classified for national security reasons, but that warrants or requires administrative control and protection from public or other unauthorized disclosure for other reasons. Personally Identifiable Information (PII), e.g., an individual's birthdate, address, and phone number and Business Identifiable Information (BII), e.g., trade secrets and financial information, fit this designation. The US government uses the term “controlled unclassified information (CUI).” [72, 227229]

Classification scheme based on potential harm resulting from unauthorized access, disclosure, loss of privacy, compromised integrity, or violation of external obligations. [230]

Security classification

A term typically associated with U.S. federal government national security information. NIST has developed a broader document that addresses security controls, defined as the safeguards or countermeasures employed within a system or an organization to protect the confidentiality, integrity, and availability of the system and its information and to manage information security risk. [231, 232]

Protection of limited data/secure platforms/enclaves

Limited data: in healthcare, a set of identifiable healthcare information that the HIPAA Privacy Rule permits covered entities to share with certain entities for research purposes if certain conditions are met. Data security platform: aggregates data protection requirements across data types, storage silos, and ecosystems to create an organization-wide data security solution. Secure data enclave: a system that allows data owners to control data access and ensure data security while facilitating approved uses of data by other parties. [233235]

Constraints and restrictions on data use and sharing

Technical, administrative, or legal limitations on the use and sharing of data.

Anonymization

A process of preserving private or confidential information by deleting or encoding identifiers that link individuals and stored data. [236]

Architectures for Application, Use, and Reuse

Extensibility across communities, including machine-based interactions

A measure of the ability to expand an RDM architecture to enable interactions with a broad group of stakeholders and types of equipment, achieved by adding new functionality or modifying existing functionality. [237]

The fundamental structure of an organization's research data management (RDM) system embodied in its components, their relationships to each other and to the environment, and the principles guiding its design and evolution. Such a structure should enable a user to capitalize on an organization's data. [60, 61]

Capture of insights from ML and use of these to improve datasets for future AI applications

Recording and retaining information obtained via computer systems that use algorithms and statistical models to enable understanding of complex problems and employing such understanding to develop enhanced datasets for new AI solutions.

Capture of data performance characteristics

Recording and retaining information concerning the quality attributes of a dataset, e.g., validity, accuracy, completeness, relevance, uniformity and consistency. [238]

Location of data

Methods whereby, and systems and devices wherein, data are saved and from which data can be retrieved, e.g., on premises, cloud, temporary cache, and removable media.

Migration strategies concerning data loss

Approaches and practices to eliminate, prevent, or reduce the intentional or unintentional destruction or disappearance of information caused by people, processes, or other means.

Economic impact of reuse

Monetary benefits of using existing data compared to re-generating identical data.

Table 6. Preserve/Discard lifecycle stage

Preserve/Discard: Topic

Subtopic

Definition

Criteria for Preservation

Use

Instances wherein datasets are utilized for meaningful purposes, e.g., problem-solving and decision-making.

Quantitative and qualitative metrics used to assess the need for long-term retention of data. [239]

Impact

Demonstrated, positive outcomes attributed to use of a dataset, e.g., a scientific discovery and a new measurement instrument or product.

Value

Merit or worth of data in terms of their usefulness and fitness for purpose, e.g., to make sound, fact-based conclusions and decisions.

Uniqueness

The quality of being unlike any other data in terms of, e.g., type and characteristics. [240]

Cost

Financial resources required to store and preserve data.

Provenance

The historical, attributed, and documented record of a data asset that contains details on its origin—where, when, how, and by whom it was generated/acquired/processed—and on all alterations to the data asset. [21, 41]

Legal and regulatory

Requirements via contract, law, regulation, or other agreement to preserve data.

Sustainability

Longevity and support

The amount of time a dataset is retained in an organization and the resources to maintain this retention. [241]

The capacity to maintain or improve the state and availability of data nd an RDM infrastructure over the long term. [242]

Funding models

Approaches to build a reliable funding base that will support an organization's core research data projects and services. [42]

Business models

Approaches to describe how an organization ensures that its research data projects and services provide value. [243]

Storage and Preservation

Methods to store and preserve data

Devices and cloud services used to retain data in the short-term and long-term. [244]

Storage is a process whereby digital data are saved for later use and access via, e.g., a device or cloud service. Preservation is a series of managed activities required to ensure continued stability and access to data for as long as necessary. [183 , 248]

File integrity

The process of protecting a file from unauthorized changes or environmental hazards, i.e., validation to determine if a file has been altered after its creation, curation, archiving, or other qualifying event. [245, 246]

Ability to do advanced searches

Capability to narrow a query through, e.g., the use of filters that eliminate irrelevant information and enable the identification of desired content. [247]

Backup and recovery

Backup: the process of making copies of data or data files to use in the event the original data or data files are lost or destroyed. Recovery: the process of restoring data that have been lost, accidentally deleted, corrupted, or made inaccessible for any reason. [249, 250]

Moving Data from One Service to Another Across Organizations

Roles and responsibilities

The job functions and obligations that enable the movement of data among organizations.

Inter-organizational transit of data.

Registry maintenance and curation

The processes of harvesting, organizing, and handling a collection of data-related resources such as repositories, services, and software, to facilitate ease of user searches and retrieval of information. Examples of registries are re3data and the NIST Materials Resource Registry. [215, 216]

Disciplinary archives

A place to store data from a specific field of study or branch of knowledge that is important but that doesn't need to be accessed or modified frequently (if at all). [74, 251]

Retention and Disposition Schedules

Technical decisions

Conclusions regarding retention and disposition of research data that are based on scientific considerations such as merit and future potential usefulness of the data, e.g., data archiving.

A timeline and plan of action based on a policy that addresses which data are important to keep for future use or reference, how that data can be searched and accessed at a later date, and which data are no longer needed and can be destroyed. [253]

Administrative/policy decisions

Conclusions regarding retention and disposition of research data that are based on logistical or operational considerations, e.g., cost of data archiving.

Deaccessioning/end-of-life

The formal, documented removal of a data collection or dataset from its location or custody of an archive service. [252]

Legal documents

Schedules for retention and disposition of data set by formal contracts or other agreements.

End-of-life special considerations

Any actions taken before disposition of data that has reached the end of its useful life or will no longer receive support for archiving. An example consideration is adhering to security protocols for sensitive data.

Recognition of removed data

Creation of a special type of landing page (i.e., tombstone page) describing the data that have been removed and providing a full bibliographic citation, a DOI (if one has been assigned), and a statement on unavailability detailing the circumstances that led to removal of the data. [254]

4 Overarching Themes

The RDaF was refined from the preliminary V1.0 using input from the two opening plenary workshops and the 15 stakeholder workshops. During this refinement process, 14 themes that spanned the various lifecycle stages were identified. Rather than repeat these themes in each stage, they are listed here with a brief explanation of their meaning in the context of research data and research data management (RDM). Following the explanatory narrative, the specific lifecycle stages/topics/subtopics in which each theme appears are shown in tabular form.

In most cases, the overarching themes are supported by explicit references in the framework. In other cases, the themes are implicit. For example, the cost implications and sustainability theme touches on every topic or subtopic, although it is not called out in any lifecycle stage: there is a financial implication to every decision and action that will be considered by those working with research data in any capacity. Note that while these 14 themes emerge from the general definitions of the topics and subtopics, considering the scope of RDM from the perspective of a specific individual or organization, other themes may emerge. Such custom themes can serve as an additional organizing function for job roles, tasks, and other activities represented by the topics and subtopics in the framework.

Separate tables generated for each overarching theme document the topics and subtopics most closely associated to that theme (see Tables 7-20 below). There are also two graphics that provide summary information. Figure 3 is a Sankey diagram that provides a visualization of the relationship between each lifecycle stage and each overarching theme. Figure 4 is a matrix table that gives a high-level overview of the relationships between the overarching themes and the topics for each lifecycle stage. (Some of the overarching theme names in Figs. 3 and 4 have been truncated or abbreviated for visualization purposes.)

Sankey key diagram showing the relationships between lifecycle stages and overarching themes. This information is in the tables below each Overarching theme section.

Fig. 3 — Sankey diagram of the relationships between lifecycle stages and overarching themes

A matrix showing the overarching themes and each topic which is explained in the text.

Fig. 4 — Matrix diagram of topics and overarching themes

4.1 Community Engagement

Community engagement, typically broader for RDM practices and more focused for research data projects, is an intentional set of approaches for both listening to and communicating with stakeholders. Successful research, data management, and data curation come from strong engagement with the community of practice or discipline and the organization in which the research is conducted. Community engagement is present in all the RDaF lifecycle stages, although there is an emphasis on it within the Envision and Plan stages. Engagement with stakeholders early in the research process may result in stronger outcomes and uptake of new research. In the other four lifecycle stages, stakeholder engagement is essential for accomplishing the goals established at the beginning of a research project.

Table 7 lists the topics and subtopics that are most relevant to the overarching theme of community engagement.

Table 7. Community engagement (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Identification of goals and roles

Vision and/or policy

Data management organization

Organizational values, including DEIA

Data management value proposition

Data needs assessment

Organization intent regarding FAIR data

End-use support

Stewardship

Data Governance—Legal and Regulatory Compliance

Privacy

Ethics

Data Culture and Reward Structure

Roles and responsibilities

Recognition of data management

Value of data workers

Promotion and tenure

Integrity of research and data

FAIR data principles

Incentives and impact for sharing and reuse

Disincentives for sharing and reuse

CARE and ethics

Education and Workforce Development

Workforce skills inventory

Workforce preparedness in new and advanced technologies

Data management training

HR’s supporting role in workforce development and training

Promotional paths and career development

Resources—Allocation and Sustainability

Staffing

Community Engagement

Stakeholder communities

Partners/partnerships

Engagement across knowledge domains and sectors

Inclusivity in interactions

Data services and the beneficiaries

Plan

Financial Aspects of Planning

Staffing and training

Data Management Planning

Purpose/intent of research study and context of anticipated data use

Specification of data entities and actions throughout the lifecycle

Data organization to facilitate future access

Data management expertise and training

FAIR

Organizational support for making data more FAIR

Hardware and Software Infrastructure

Interoperability

Security and privacy considerations

Research Data Standards

Sources of standards/guidelines for data/metadata

Community-based standards/conventions

Communication and Outreach

Methods to share and reuse data/metadata

Allocation of credit to project team members

Promotion of data to communities of interest

Cross-institution cooperation

Requests for additional data from the research community

Generate/Acquire

FAIR Principles

Guidelines/methodologies for each aspect: F, A, I, R

Community-Based Standards

General vs. domain-specific

Standards development organizations vs. community consensus

Vocabulary and ontology

Process/Analyze

Metadata

Responsible parties

Provenance

CRediT taxonomy

Workflow and Middleware

Collaboration tools

Share/Use/Reuse

Publishing

Repository

Peer review of datasets and metadata

Curation

Publisher agreements and policies

Incentives for data publishing

Mitigation of disincentives for data publishing

Modes of Dissemination

Data landing page

Legal and Licenses

Indigenous data rights

Usage agreements/terms/licenses and required permissions

Preserve/Discard

Criteria for Preservation

Use

Impact

Value

Uniqueness

Sustainability

Longevity and support

Funding models

Moving Data from One Service to Another Across Organizations

Roles and responsibilities

Registry maintenance and curation

Disciplinary archives

Retention and Disposition Schedules

End-of-life special considerations

4.2 Cost Implications and Sustainability

Cost implications and sustainability is a theme that touches every lifecycle stage and most stakeholders in the research ecosystem. From Chief Data Officers and provosts to researchers and grant administrators, cost is a constant focus of all individuals’ work in public and private organizations. Administrators and C-suite officers would typically focus their efforts on the stages of Envision and Plan, while researchers, particularly those with curation duties and service provision, have more impact on the cost implications in the Generate/Acquire, Process/Analyze, Share/Use/Reuse, and Preserve/Discard stages.

Sustainability in research and RDM means sustainable funding, staffing, and preservation models as applied to research data. It is imperative that sustainable plans affecting these three areas are assessed as the areas are developed and maintained to prevent institutions and users from losing access to valuable datasets.

Table 8 lists the topics and subtopics that are most relevant to the overarching theme of cost implications and sustainability.

Table 8. Cost implications and sustainability (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Data management organization

Data needs assessment

Organization intent regarding FAIR data

End-use support

Stewardship

Data Governance—Legal and Regulatory Compliance

Risk assessment

Risk mitigation and management

Data Culture and Reward Structure

Value of data workers

Promotion and tenure

FAIR data principles

Maintenance of FAIR data

Incentives and impact for sharing and reuse

Disincentives for sharing and reuse

Education and Workforce Development

Workforce preparedness in new and advanced technologies

Data management training

Promotional paths and career development

Resources—Allocation and Sustainability

Sources of funding

Long-term funding

Staffing

Community Engagement

Partners/partnerships

Data services and the beneficiaries

Planning

Financial Aspects of Planning

Funding models for provisioning resources

Funding sources

Decision-making tools to assess costs

Cost-benefit analysis

Cost breakdown by lifecycle stage

Downstream lifecycle costs

Staffing and training

Data Management Planning

Purpose/intent of research study and context of anticipated data use

Data organization to facilitate future access

Data management expertise and training

Data/Metadata Considerations

Criteria for selection of data/metadata

Data Architecture

Design

Hosting and storage, cloud storage

Security

Hardware and Software Infrastructure

Organizational research needs

Sustainability of data vis-à-vis obsolete infrastructure

Security and privacy considerations

Staff expertise and support staff

Access Control Associated with Data Sensitivity

Regulatory compliance

Sensitive data/PII

Limited disclosure, IP

Licensing for reuse

Generate/Acquire

Generated Computational Data

Hardware

Parameters and conditions for computation

Acquired Data

From collaborators

From repositories

From the literature

Aggregated datasets from multiple sources

Restrictions, fees, and usage agreements

Acquisition Software

Open source vs. proprietary

LIMS

Process/Analyze

Software

Commercial vs. custom

Open source vs. proprietary

Workflow and Middleware

LIMS

Collaboration tools

Hardware

Compute requirements

Storage requirements

Network requirements

Accelerator requirements

Share/Use/Reuse

Publishing

Repository

Publisher agreements and policies

Legal and Licenses

Ownership

Data sharing and licensing agreements

Service-level agreements

Architectures for Application, Use, and Reuse

Economic impact of reuse

Preserve/Discard

Criteria for Preservation

Cost

Sustainability

Longevity and support

Funding models

Business models

Storage and Preservation

Methods to store and preserve data

4.3 Culture

Culture is the basis for the entirety of a given organization’s success in managing research data and in nearly every other aspect of running a collective enterprise; culture is what gives an institution or organization its character and consistency over time. Cultures are firmly embedded and stem from both informal practices and formal written policies which can make them difficult to change. Culture shapes norms within an organization and creates glide paths towards ingrained values and behaviors as well as resistance to others. Specifically, culture dictates how research data are valued or supported in an institution.

Table 9 lists the topics and subtopics that are most relevant to the overarching theme of culture.

Table 9. Culture (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Identification of goals and roles

Vision and/or policy

Data management organization

Organizational values, including DEIA

Data management value proposition

Purpose and value of data

Organization intent regarding FAIR data

Stewardship

Data Governance—Legal and Regulatory Compliance

Ethics

Safety and security assurance

Risk mitigation and management

Sharing/licensing

Data Culture and Reward Structure

Roles and responsibilities

Recognition of data management

Value of data workers

Promotion and tenure

Integrity of research and data

FAIR data principles

Maintenance of FAIR data

Incentives and impact for sharing and reuse

Disincentives for sharing and reuse

CARE and ethics

Education and Workforce Development

Workforce preparedness in new and advanced technologies

Data management training

HR’s supporting role in workforce development and training

Promotional paths and career development

Community Engagement

Stakeholder communities

Partners/partnerships

Engagement across knowledge domains and sectors

Inclusivity in interactions

Data services and the beneficiaries

Plan

Chain of Custody

Roles and responsibilities

Financial Aspects of Planning

Funding models for provisioning resources

FAIR

Organizational support for making data more FAIR

Hardware and Software Infrastructure

Organizational research needs

Interoperability

Security and privacy considerations

Staff expertise and support staff

Research Data Standards

Requirements and needs

Quality standards

Community-based standards/conventions

Communication and Outreach

Methods to share and reuse data/metadata

Allocation of credit to project team members

Promotion of data to communities of interest

Cross-institution cooperation

Generate/Acquire

FAIR Principles

Data born FAIR

Data made FAIR

FAIR digital objects

FAIR on a continuous scale

Guidelines/methodologies for each aspect: F, A, I, R

Tools to capture FAIR provenance

FAIR instruments and tools

Not FAIR data

Community-Based Standards

General vs. domain-specific

Standards development organizations vs. community consensus

Metadata format and file structure

Interoperability

Process/Analyze

Preparation and Pre-Processing Methods

De-identification, anonymization

Curation

Software

Commercial vs. custom

Opensource vs. proprietary

Share/Use/Reuse

Publishing

Repository

Data paper

Software

Updates to datasets and new software versions

Data linking

Persistent identifier

Metadata

Integrity of data

Peer review of datasets and metadata

Reference data/digital objects in journal articles

Curation

Incentives for data publishing

Mitigation of disincentives for data publishing

Modes of Dissemination

Traditional journal article

Supplementary material

On request

Data landing page

Workflow

Mainstream media

Social media

Attribution

Dataset citation

Modes of Sharing

Standardized formats

Access

Availability statement

Mitigation of barriers and economic constraints

Legal and Licenses

Ownership

Encouragement and support for sharing, use, and reuse

Indigenous data rights

Data sharing and licensing agreements

Preserve/Discard

Criteria for Preservation

Use

Impact

Value

Uniqueness

Sustainability

Longevity and support

Funding models

Moving Data from One Service to Another Across Organizations

Roles and responsibilities

Registry maintenance and curation

Disciplinary archives

Retention and Disposition Schedules

End-of-life special considerations

4.4 Curation and Stewardship

The processes and procedures to make research data shareable and reusable are typically referred to as curation and stewardship. Both curation and stewardship, and the job roles that are responsible for them, aim to collect, manage, preserve, and promote research data over their lifecycles. Curation is often performed by librarians and others outside of a laboratory or research group, while data stewards tend to work with a specific research group, lab, or department (i.e., a specific discipline) to ensure that they are embedded in research projects from the onset of the Plan lifecycle stage. Because curators tend to work outside of labs, they are typically engaged in research projects much later during the Share/Use/Reuse stage, which may introduce complications. The curation and stewardship theme implicitly touches each lifecycle stage.

Table 10 lists the topics and subtopics that are most relevant to the overarching theme of curation and stewardship.

Table 10. Curation and stewardship (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Data management organization

Organization intent regarding FAIR data

Stewardship

Data Culture and Reward Structure

Roles and responsibilities

Recognition of data management

Value of data workers

Promotion and tenure

Integrity of research and data

FAIR data principles

Incentives and impact for sharing and reuse

Disincentives for sharing and reuse

CARE and ethics

Education and Workforce Development

Workforce skills inventory

Data management training

Promotional paths and career development

Resources—Allocation and Sustainability

Staffing

Community Engagement

Stakeholder communities

Partners/partnerships

Engagement across knowledge domains and sectors

Inclusivity in interactions

Data services and the beneficiaries

Plan

Chain of Custody

Roles and responsibilities

Financial Aspects of Planning

Staffing and training

Data Management Planning

Written data management plans (DMPs)

Specification of data entities and actions throughout the lifecycle

Machine-readable DMPs

Data organization to facilitate future access

Data management expertise and training

FAIR

Organizational support for making data more FAIR

Identification of methods/guidelines vis-à-vis FAIR principles

Research Data Standards

Requirements and needs

Sources of standards/guidelines for data/metadata

Quality standards

Community-based standards/conventions

Assessment

Metrics for tracking use and impact measures, including reuse

Communication and Outreach

Methods to share and reuse data/metadata

Allocation of credit to project team members

Promotion of data to communities of interest

Cross-institution cooperation

Requests for additional data from the research community

Access Control Associated with Data Sensitivity

Identification of responsible parties for access management

Regulatory compliance

Sensitive data/PII

Limited disclosure, IP

Licensing for reuse

Generate/Acquire

FAIR Principles

Data made FAIR

Guidelines/methodologies for each aspect: F, A, I, R

Not FAIR data

Community-Based Standards

General vs. domain-specific

Standards development organizations vs. community consensus

Data format and file structure

Metadata format and file structure

Vocabulary and ontology

Interoperability

Process/Analyze

Preparation and Pre-Processing Methods

Curation

Normalization of metadata

Metadata

Types of metadata

Responsible parties

Specification of metadata standards

Linked data structure

Persistent identifiers

Provenance

Original authoritative copy

Version identification

Derivative product

Aggregation

Subset

Timestamp

CrediT taxonomy

Share/Use/Reuse

Publishing

Repository

Data paper

Software

Updates to datasets and new software versions

Data linking

Persistent identifier

Metadata

Integrity of data

Quality measures and assessment vis-à-vis fit for purpose

Peer review of datasets and metadata

Reference data/digital objects in journal articles

Curation

Publisher agreements and policies

Incentives for data publishing

Mitigation of disincentives for data publishing

Attribution

Citation metrics

Citation impact

Dataset citation

Provenance

Author identity management

Use of persistent identifiers

Versioning

Modes of Sharing

Standardized formats

Interoperability tools

Discovery platforms

Catalogs

Registries of repositories

Access

Internal access

External access

Programmatic access

Virtual and physical enclaves

Access vs. visiting

Availability statement

Mitigation of barriers and economic constraints

Legal and Licenses

Ownership

Encouragement and support for sharing, use, and reuse

Indigenous data rights

Intellectual property rights/restrictions

Usage agreements/terms/licenses and required permissions

Standardized, machine-actionable license documents

Citation requirements

Levels of Protection

Constraints and restrictions on data use and sharing

Preserve/Discard

Criteria for Preservation

Use

Impact

Moving Data from One Service to Another Across Organizations

Roles and responsibilities

Registry maintenance and curation

Disciplinary archives

Retention and Disposition Schedules

Technical decisions

Administrative/policy decisions

Deaccessioning/end-of-life

End-of-life special considerations

Recognition of removed data

4.5 Data Quality

Data quality directly impacts a dataset’s fitness for purpose, usability, and reusability. All parties involved in every stage of a dataset’s lifecycle should be cognizant of data quality. The CODATA Research Data Management Terminology [5] definition of data quality includes the following attributes: accuracy, completeness, update status, relevance, consistency across data sources, reliability, appropriate presentation, and accessibility. Assessment of data quality is not a single process, but rather a series of actions that, over the lifetime of a dataset, collectively assure the greatest degree of quality.

Table 11 lists the topics and subtopics that are most relevant to the overarching theme of data quality.

Table 11. Data quality (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Purpose and value of data

Stewardship

Data Culture and Reward Structure

Roles and responsibilities

Education and Workforce Development

Data management training

Plan

Research Data Standards

Quality standards

Generate/Acquire

Generated Computational Data

Verification/validation of output data

Critically Evaluated (CE) Data

Infrastructure to assure the greatest data integrity

Process/Analyze

Preparation and Pre-Processing Methods

Data cleaning

De-identification, anonymization

Amputation and imputation

Aggregation

Validation and verification

Normalization of metadata

Software

Testing and validation tools

Documentation

Share/Use/Reuse

Publishing

Integrity of data

Quality measures and assessment vis-à-vis fit for purpose

Modes of Sharing

Standardized formats

Preserve/Discard

Criteria for Preservation

Use

Impact

Value

Uniqueness

4.6 Data Standards

Data standards, both discipline-specific (e.g., Darwin Core [255] or NeXus [256]) and general (e.g., PREMIS [257] or schema.org [258]) are implemented by researchers to make their datasets both more FAIR and of higher quality. Researchers may use formal (e.g., ISO [259] or ANSI [260] standards) or de facto (e.g., DataCite [209]) standards for their research community. Use of data standards ensures consistency within a discipline and can reduce cost by decreasing the likelihood that data will have to be created again. Data standards are called out in every lifecycle stage except Envision.

Table 12 lists the topics and subtopics that are most relevant to the overarching theme of data standards.

Table 12. Data standards (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Stewardship

Data Culture and Reward Structure

Recognition of data management

Integrity of research and data

FAIR data principles

Maintenance of FAIR data

Education and Workforce Development

Workforce skills inventory

Data management training

Community Engagement

Engagement across knowledge domains and sectors

Plan

Data Management Planning

Written data management plans (DMPs)

Specification of data entities and actions throughout the lifecycle

Machine-readable DMPs

Data organization to facilitate future access

Data management expertise and training

Data Object

Measurement

Observation

Survey

Software

Specimen (physical sample)

FAIR

Identification of methods/guidelines vis-à-vis FAIR principles

Data/Metadata Considerations

Criteria for selection of data/metadata

Nature of data/metadata required

Methods to capture and store data/metadata

Metadata schema

Data Architecture

Model

LIMS

Interoperability among different architectures

Existing standards

Hardware and Software Infrastructure

Interoperability

Persistent instrument identifiers

Research Data Standards

Requirements and needs

Sources of standards/guidelines for data/metadata

Quality standards

Community-based standards/conventions

Generate/Acquire

Data Types

Measurement

Text file

Computation, simulation

Source code

Observation

Survey

Transaction

Social media

Acquired Data

Provenance

Critically Evaluated (CE) Data

Infrastructure to assure the greatest data integrity

FAIR Principles

Data born FAIR

Data made FAIR

FAIR digital objects

Guidelines/methodologies for each aspect: F, A, I, R

Tools to capture FAIR provenance

FAIR instruments and tools

Community-Based Standards

General vs. domain-specific

Standards development organizations vs. community consensus

Data format and file structure

Metadata format and file structure

Interoperability

Process/Analyze

Metadata

Types of metadata

Specification of metadata standards

Linked data structure

Persistent identifiers

Provenance

Original authoritative copy

Version identification

CrediT taxonomy

Software

Standards, protocols, and interfaces

Share/Use/Reuse

Publishing

Persistent identifier

Metadata

Integrity of data

Curation

Attribution

Citation metrics

Dataset citation

Provenance

Author identity management

Use of persistent identifiers

Versioning

Modes of Sharing

Standardized formats

Legal and Licenses

Standardized, machine-actionable license documents

Preserve/Discard

Criteria for Preservation

Provenance

Storage and Preservation

Methods to store and preserve data

File integrity

Moving Data from One Service to Another across Organizations

Registry maintenance and curation

Retention and Disposition Schedules

End-of-life special considerations

4.7 Diversity, Equity, Inclusion, and Accessibility

Diversity, equity, inclusion, and accessibility (DEIA) is a broad theme covering important social and cultural aspects of a research enterprise. Efforts in DEIA center on growing the sense of belonging for everyone in every laboratory, research group, department, or institution. Research data practices are not immune to biases and historical disadvantages must often be addressed through intentional action. DEIA is important not just for members of underrepresented and marginalized groups, but for the integrity of the research process as a whole. More inclusive research tends to be more rigorous as it introduces different perspectives that enable more complete and broader interpretations of research data. Given the typical challenges associated with cultural changes within an institution, DEIA efforts must be embedded throughout the research data management lifecycle to maximize their effectiveness.

Table 13 lists the topics and subtopics that are most relevant to the overarching theme of diversity, equity, inclusion, and accessibility.

Table 13. Diversity, equity, inclusion, and accessibility (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Vision and/or policy

Organizational values, including DEIA

Data Governance—Legal and Regulatory Compliance

Ethics

Social license for use and reuse

Data Culture and Reward Structure

Roles and responsibilities

Recognition of data management

Value of data workers

CARE and ethics

Education and Workforce Development

Promotional paths and career development

Community Engagement

Stakeholder communities

Partners/partnerships

Engagement across knowledge domains and sectors

Inclusivity in interactions

Data services and the beneficiaries

Plan

Financial Aspects of Planning

Staffing and training

Data Management Planning

Purpose/intent of research study and context of anticipated data use

Data/Metadata Considerations

Nature of data/metadata required

Methods to capture and store data/metadata

Hardware and Software Infrastructure

Staff expertise and support staff

Research Data Standards

Community-based standards/conventions

Assessment

Goals/definition of success

Metrics for tracking use and impact measures, including reuse

Communication and Outreach

Methods to share and reuse data/metadata

Allocation of credit to project team members

Promotion of data to communities of interest

Cross-institution cooperation

Requests for additional data from the research community

Access Control Associated with Data Sensitivity

Identification of responsible parties for access management

Sensitive data/PII

Generate/Acquire

Data Sources

In-house generation by researchers

Remote generation by researchers

In-field generation by researchers

User facility generation by/for researchers

Historical

Human-annotated

Qualitative Data

Methods and protocols

Data/metadata/paradata capture methods

Acquired Data

From collaborators

From the literature

Community-Based Standards

General vs. domain-specific

Standards development organizations vs. community consensus

Process/Analyze

Preparation and Pre-Processing Methods

De-identification, anonymization

Modeling

ML, AI

Metadata

Responsible parties

Provenance

CrediT taxonomy

Share/Use/Reuse

Publishing

Curation

Incentives for data publishing

Mitigation of disincentives for data publishing

Attribution

Author identity management

Access

External Access

Mitigation of barriers and economic constraints

Legal and Licenses

Ownership

Encouragement and support for sharing, use, and reuse

Indigenous data rights

Levels of Protection

Unclassified but sensitive information

Protection of limited data/secure platforms/enclaves

Constraints and restrictions on data use and sharing

Architectures for Application, Use, and Reuse

Extensibility across communities, including machine-based interactions

Preserve/Discard

Criteria for Preservation

Use

Impact

Value

Uniqueness

Retention and Disposition Schedules

Deaccessioning/end-of-life

End-of-life special considerations

4.8 Ethics, Trust, and the CARE Principles

Ethics, trust, and the CARE principles encompass the ethical generation, analysis, use, reuse, sharing, disposal, and preservation of data and are pillars of responsible research that are called out throughout the framework. The phrase “as open as possible, as closed as necessary” [261] comes to mind when working through the ethical implications of sharing data. While ethical choices are often made at the Share/Use/Reuse lifecycle stage, questions and concerns regarding the generation or collection of data are likely to be examined by an institutional or ethics review board and must be considered in the Plan stage. In the Preserve/Discard stage, it is essential to comply with preservation and disposition standards. While the subtopics in the framework are a starting point for understanding how ethics touches every aspect of the research data lifecycle, it is also important that a project be securely grounded in the practices of a given discipline; for example, the standards for historical research will differ from those for economic or healthcare research.

Trust is a factor across the Framework and is the basis for relationships between data producers and users, the funding agencies that support projects, and the institutions that host research. Specific populations will also have various ethical considerations, for example, the CARE Principles for Indigenous Data Governance are quickly becoming the standard for working with indigenous data worldwide [262].

Table 14 lists the topics and subtopics that are most relevant to the overarching theme of ethics, trust, and the CARE principles.

Table 14. Ethics, trust, and the CARE principles (overarching theme)

Lifecycle Stage

Topic

Subtopic

Envision

Data Governance – Strategic/Qualitative

Data management value proposition

Stewardship

Data Governance—Legal and Regulatory Compliance

Ethics

Sharing/licensing

Data Culture and Reward Structure

Roles and responsibilities

Recognition of data management

Value of data workers

Promotion and tenure

Integrity of research and data

Incentives and impact for sharing and reuse

Disincentives for sharing and reuse

CARE and ethics

Resources—Allocation and Sustainability

Sources of funding

Long-term funding

Staffing

Community Engagement

Stakeholder communities

Partners/partnerships

Engagement across knowledge domains and sectors

Inclusivity in interactions

Plan

Chain of Custody

Roles and responsibilities

Implementation authority

Data Management Planning

Written data management plans (DMPs)

Purpose/intent of research study and context of anticipated data use

Specification of data entities and actions throughout the lifecycle

Data organization to facilitate future access

Data management expertise and training

Data Object

Quantitative and qualitative

Data/Metadata Considerations

Methods to capture and store data/metadata

Data Architecture

Design

Workflow

Model

Security

Hardware and Software Infrastructure

Security and privacy considerations

Research Data Standards

Requirements and needs

Quality standards

Community-based standards/conventions

Communication and Outreach

Allocation of credit to project team members

Promotion of data to communities of interest

Cross-institution cooperation