Scientific Stewardship in the Open Data and Big Data Era

Scientific Stewardship in the Open Data and Big Data Era — Roles and Responsibilities of Stewards and Other Major Product Stakeholders

http://dlib.org/dlib/may16/peng/05peng.html

Abstract

Ensuring and improving quality and usability is an important part of scientific stewardship of digital environmental data products, but the roles of the responsible parties — those who manage quality and usability — have been evolving over time and have not always been clearly defined. Recognizing that in the Open Data and Big Data era, effective long-term scientific stewardship of data products requires an integrated and coordinated team effort of experts in multiple knowledge domains — data management, science, and technology — we introduce the following stewardship roles for each of these domains: data steward, scientific steward, and technology steward. This article defines their roles and high-level responsibilities as well as the responsibilities of other major product stakeholders, including data originators and distributors. Defining roles and formalizing responsibilities will facilitate the process of curating and communicating quality information to users. Clearly defined roles will allow effective cross-disciplinary communication and better resource allocation for data stewardship, supporting organizations in meeting the challenges of stewarding digital environmental data products in the Open Data and Big Data era.

Keywords: Scientific Data Stewardship, Information Quality, Data Steward, Scientific Steward, Technology Steward, Open Data, Big Data

2 Terms and Definitions

A number of terms are used throughout this article. Their definitions and usage within the context of this article are described below in alphabetical order for clarity and reference:

  • Archive is an organization that intends to preserve information for access and use by a designated community (CCSDS, 2012; ISO 14721, 2012).
  • Data distributors are people or entities that provide access to data and/or information to consumers. This role may include data providers and data publishers. They may be a liaison between archives and data users. They may or may not be affiliated with archives or repositories. In some cases, users may obtain data directly from the data originators.
  • Data managers are people who oversee the processing of data in an operational environment (Chisholm, 2014; NOAA, 2011). Data managers are oriented to working with data and ensuring the data are available, but not concerned with prompting good data governance within their domain (Chisholm, 2014).
  • Data originators are people or entities that generate data products. They could be a data producer or a data provider. In a research environment, it is usually the principal investigator associated with a certain project or program. Data providers may be a liaison between data producers and archives. Sometimes, data originators may also serve as data distributors.
  • Data products can be both original measurements and derived scientific products (adapted from the NCEI Glossary of Terms; see also Asrar and Ramapriyan, 1995; Committee on Earth Observation Satellites, 1999). However, data products are used in this article to denote post-processed and formatted science products created from either original measurements or derived data. Therefore, they refer to NOAA and NASA Level 2 to Level 4 data products defined in the Federal Geographic Data Committee (FGDC) Content Standard (FGDC, 2002). (For Level 0 and Level 1 data products, additional expertise in instruments and sensors used for measurements is crucial.)
  • Dataset is defined as "identifiable collection of data" (ISO 19115, 2003) that may contain one or more data files in identical format, having the same geophysical variable(s) and product specification(s) such as the geospatial location or spatial grid. A dataset may contain original measurements or a derived product of a fixed version. Model output such as forecasts, projections, analyses, or re-analyses can be treated as a special case of derived products.
  • Designated data center refers to an institution that intends to preserve information long-term for access and use by a designated community (CCSDS, 2012). Designated national data center is referred to as an archive that, in the United States, is required to be compliant with the standards and best practices of the National Archive and Records Administration (NARA) (NOAA, 2008). For example, NCEI is such an archive for Earth Science and geospatial data and information. On the other hand, Non-designated data center refers to a facility where extensive collections of environmental parameters are maintained because of individual research, institutional research, or operational requirements (e.g., the National Ice Center). A non-designated data center must still adhere to basic good stewardship practices, such as off-site backup and maintenance of adequate environmental control and security of their holdings, but may not be fully compliant with all of the NARA-accepted archival standards (NOAA, 2008).
  • Environmental data are the recorded and/or derived geospatial observations and measurements of the physical, chemical, biological, geological, or geophysical properties or conditions of the oceans, atmosphere, space environment, sun, and solid earth, as well as correlative data and related documentation or metadata as defined by NOAA (2008).
  • Geospatial data are observations or data products that describe the state and impact of environmental systems and include information on the geographic location and characteristics of constructed features and boundaries of the earth (EPA, 2005; NOAA, 2008) at a particular time or over a period of time. Therefore, information about spatial and temporal characteristics of data products and support for spatial and temporal subsetting will make it easier for end-users to get and efficiently use the data products.
  • Object is defined as a digital data file, a paper record, an image, an article, or a collection of any or a mix of those items. An object and its accompanying metadata and documents may remain the same or, more likely, be modified somewhere between its submission and use. Additional metadata may be created and information about the object may be captured and made available to users during archival, stewardship, and service processes. This additional information may refer to, but is not limited to, descriptive and representative documents about the object, including retrieval algorithms, input data sources, and processing steps, for enhanced transparency and understandability; about the software and hardware systems used to generate the object for enhanced transparency and reproducibility; about the product quality procedures used to ensure product quality for enhanced data trustworthiness; and about how to get the data files and to use the product for enhanced data discoverability and usability. Capturing and conveying information about data, either through metadata or documentation, in a consistent manner will not only improve machine readability, namely, interoperability, but also make it easier for users to compare various products to determine the suitability for their applications.
  • Non-functional requirements, in systems and software engineering, specify criteria that can be used to judge the operation of a system (ISO 25010, 2011; Chung, 1993). They are used in this paper to refer to constraints imposed on the preservation and stewardship of environmental data by federal laws, mandates, guidelines, and regulations (Peng et al., 2015).
  • Repository refers to a place where a large amount of something is stored (Merriam-Webster). The World Data System (WDS) of the International Council for Science (ICSU) has specified the repository types as: Domain or subject-based repository; Institutional repository; National repository system; Publication repository; Library/Museum/Archives; Research project repository (WDS-ICSU, 2015). In this article, repository refers to a facility that follows basic good stewardship practices including maintenance of adequate environmental control for its storage. A non-designated data center may sometime be referred to as a repository. For the sake of simplicity, unless mentioned otherwise to focus on the difference between designated/non-designated data centers and repositories, the term archive will be used in this article to denote an archive, a data center, or a repository.
  • Scientific quality refers to the accuracy, precision, validity and suitability of product for intended applications (Ramapriyan et al., 2015).
  • Stakeholder, in terms of project management, is defined as "An individual, group, or organization which may affect, be affected by, or perceive itself to be affected by a decision, activity, or outcome of project" (PMI, 2013). Adapting this definition to product management, product stakeholder in this article refers to an individual, group, or organization that is involved in, has an interest in, or is potentially impacted by development, creation, preservation, stewardship, distribution, service, or application processes of the data product. Product key player refers to an individual, group, or organization that develops, produces, curates, stewards, publishes, or serves the data product to users. Other product stakeholders include but are not limited to product sponsors, project or program managers, and users.
  • Usability is defined in a broad sense as "the ease of use and learnability of a human-made object" (Wikipedia). The International Standard (ISO 9241-11, 1998) defines usability as "The extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use." In this article, usability refers to product usability in terms of providing additional information about data products, including characteristics such as their statistical mean states and variability, uncertainty estimates, etc., to make it easier for users to understand and use the data product.

3.3 Roles of stewards

Roles within the long-term data management, preservation, and stewardship processes are separated into data, scientific, and technology stewards. Stewards in this article are roles assigned to domain subject matter experts (SME). SMEs are people with extensive knowledge and experiences in their fields. The role of SME is gained and not assigned (Chisholm, 2014). Stewards need to have a mindset of caring for other people's data and need to be concerned with how users are doing with the data in a broader domain (Chisholm, 2014; Information Management, 2014; Peng, 2015). Therefore, not all SMEs are capable of becoming a steward.

Stewards are considered to be at the highest rank in their own domain knowledge and expertise hierarchy, while the other roles in the same domain hierarchy may be simply defined as a point of contact (POC), a specialist, or a subject matter expert. Overall, stewards need to be aware of federal policies and mandates and governmental guidelines, help define functional requirements to meet those non-functional requirements, define procedures and provide domain best practices guidance to others. Therefore, stewards serve as a centralized domain knowledge and communication hub.

3.3.1 Role of data stewards

The role of data stewards has been previously defined as leading governance practices and providing guidelines on governance (Khatibloo et al., 2014; Information Management, 2014; Chatfield and Selbach, 2011). From the scientific data stewardship perspective, data stewards are responsible for ensuring compliance with data management standards, including community standards on data quality metadata and policies such as the U.S. Information Quality Act (U.S. Public Law 106-554, 2001) and Open Data Policy (OMB, 2013). They also need to provide data management guidance and help define data management requirements to other stewards, documentation and metadata team members, and other key stakeholders.

Someone currently fulfilling the role of data manager with extensive knowledge in data management and preservation could be assigned the role of a data steward. It is, however, important for the person to expand his or her general knowledge in technology and scientific domains and to have the mindset of promoting good data management practices beyond the normal community for which the person is generally responsible.

3.3.2 Role of scientific stewards

For environmental and geospatial data, precision and accuracy of the data itself is vital, but having complete, correct metadata and other relevant information about the data (e.g., spatial, temporal, and spectral characterizations, uncertainty sources and estimates) is equally important for effective long-term preservation and use of the data. Expert bodies (NRC, 20052007) have established the need for and emphasized the importance of scientific oversight for environmental data products. The responsibility of ensuring data quality and improving data usability traditionally fell on the shoulders of data producers but is migrating to that of data managers, in part, as a result of the requirements for making data accessible in an open and timely fashion, driven by user needs. However, effectively and accurately capturing, describing, and conveying data quality information in a timely manner can be beyond the scope or capability of many data producers and data managers or even data stewards, when the tasks are formed alone. To address the need to fill this capacity gap, Peng et al. (2015) introduced the concept of scientific steward.

The role of scientific stewards is to provide expert knowledge about the subject that the dataset is associated with, such as temperature or precipitation; to provide scientific oversight to ensure the accurate scientific representation of data and metadata values, namely, scientific integrity; to provide information or guidance on data quality and characterization (Peng et al., 2015); and to help define data quality and usability requirements to other stewards, data producers, and other key stakeholders.

While it is important to have scientific stewards participate and oversee the basic stewardship services, such as the first two levels of tiered data stewardship service defined by NCEI (2014), shown in Figure 4, the role of scientific steward becomes essential for achieving or ensuring higher levels of stewardship maturity and service (see Peng, 2015 for definitions of stewardship maturity levels for individual datasets). This is particularly true for functional areas associated with evaluating and monitoring product quality and with improving product usability by providing or promoting the availability of data characteristics, such as spatial and temporal means and their variability, data error sources, and uncertainty estimates (Figures 4 and 5).

peng-fig4

Figure 4: Tiers showing levels of stewardship services for NOAA's environmental data products. (Source: NCEI (2014). Courtesy of Kenneth Casey, NCEI.)

peng-fig5

Figure 5: Diagram of functional areas (cyan-filed boxes) for scientific stewardship of digital environmental data products.

It is possible that a data originator, such as a principal investigator, can act as a scientific steward. However, it is essential for that person to gain general knowledge of data management, to be familiar with tools used for archive and access, to have a basic understanding of user requirements, and to be willing to work closely with data and technology stewards.

3.3.3 Role of technology stewards

In the Big Data era, increased data volumes and variety, complex data structures, and low data latency requirements have made it difficult to manually assess and monitor data quality of all data holdings. For ensuring data quality, the development and maintenance of tools for monitoring product quality becomes an important part of scientific stewardship (Figure 5). Currently, a gap often exists between managing data quality and defining requirements for software and system development. To fill the gap, either data managers or technical professionals must gain the application or scientific knowledge required to define the appropriate requirements for the tools.

As data are increasingly treated as valuable assets for decision-makers, decision support based on fast data analysis has made ensuring data quality a critical but challenging task. Therefore, having tools available is not just helpful but a necessity for effectively stewarding and serving digital scientific data. Those tools allow data and scientific stewards to effectively capture, describe, and convey data quality information. Tools help monitor data quality, in addition to supporting data preservation and access processes. To develop tools that are useful and usable to data and scientific stewards, software developers must be able to understand and capture the data use and stewardship requirements and define their implementation requirements. Tools are also beneficial to end-users, such as those allowing users to view data products before requesting aggregated or subsetted data for their unique applications.

The role of technology stewards is defined in this article to fulfill such a need. A technology steward has domain knowledge including, but not limited to, software development, database management, web service application development, and system integration. Technology stewards need to have general knowledge of data and metadata management and of the general requirements of users of digital environmental data and information.

The role of a technology steward in ensuring and improving data quality and usability rests with providing software and system guidance, ensuring compliance of community interoperability standards, ensuring data integrity during system and technology upgrades, and defining system requirements for other stewards, development team members, and other key stakeholders.

The role of technology steward is likely to be assigned to a software or system developer or engineer. Again, it is crucial for the technology steward to gain general knowledge of data management and science domains and to have a mindset for promoting good data interoperability and usability practices to a broader domain.

In short, now and into the future, successful and effective long-term management, preservation, and scientific stewardship of digital environmental data products requires an integrated and coordinated effort of a team of stewards — subject matter experts in three different domains. They are data stewards, scientific stewards, and technology stewards. It is recommended that all three types of stewards learn the basic knowledge of the others to be most effective in communicating with each other and with other product stakeholders.

RELATED ARTICLESExplain
Knowledge Federation Webservices Protocol
References
Scientific Stewardship in the Open Data and Big Data Era
Open mHealth DSU (Data Storage Unit)
Representational state transfer (REST)
Benefits of a Decoupled CMS Architecture
Building JSON-LD APIs: Best Practices
JSON-LD 1.1
JSON-LD: JSON for Linked Data
Knowledge Federation: Necessity and Required Technologies
QTor: a flexible Publish/Subscribe peer-to-peer organization
The Scholix Framework for Interoperability
Transforming User Knowledge into Archival Knowledge
Tutorial 3: Introduction To JSON-LD
Graph of this discussion
Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip