ScienceBase Updates - Fall 2022
Fall 2022 topics include news on the ScienceBase integration with Globus to support release of large USGS datasets, making your data release more accessible, a tip on connecting directly to a .csv or .txt file in ScienceBase, and a featured data release on monitoring trends in burn severity.
ScienceBase Integration with Globus to Support Release of Large USGS DatasetsAs the size of USGS research outputs continues to increase, the ability to store and publicly host these ever-growing datasets needs to keep pace. In 2017, the Science Analytics and Synthesis (SAS) Science Data Management team completed the certification process to establish ScienceBase as a USGS Trusted Digital Repository. While ScienceBase saw a large uptick in use for public data release, large files continued to pose challenges for researchers; at that time, ScienceBase could only handle file uploads of approximately 2 GB. Since then, the ScienceBase team has made incremental progress in the size of files supported within the system. First, an increase in supported files of up to 10 GB when the ScienceBase large file uploader was introduced in 2016. Later in 2019, the supported file size rose to 30 GB with the ability to upload files directly to ScienceBase cloud storage. More recently, the ScienceBase team has been contacted by researchers needing to release much larger data products, both with respect to the size of individual files (e.g., 400+ GB), as well as the number of files (e.g., 100,000+ files). To meet this growing need, the ScienceBase data release team has developed two processes that can now use Globus to facilitate data transfer and access. First, what is Globus? Globus is a service that allows users to efficiently, reliably, and securely move data between systems through a single web interface. Essentially, Globus can monitor a file transfer and can restart where it left off in the event of a network interruption, dramatically improving resiliency for large data transfers. Globus is in widespread use in the research landscape, with Globus endpoints available at hundreds of universities, laboratories, and computing facilities around the world. The USGS now has a subscription to Globus and multiple Globus endpoints. USGS users can log into Globus with their Active Directory credentials. Case 1: Globus to ScienceBase TransferScienceBase can now ingest files from Amazon Web Services (AWS) S3 buckets with the proper Identity and Access Management (IAM) configuration. This supports the ability to pull files from other USGS Cloud Hosting Solutions (CHS) locations, or research partners, into ScienceBase CHS storage. However, many researchers in USGS still do not work directly with S3 buckets (via console or command line interface), and those who do may find the IAM configuration process challenging. To solve this problem, the Science Analytics and Synthesis group within Core Science Systems has established an AWS S3 bucket with the proper IAM configuration to support ingest into ScienceBase. This eliminates the complexity of working through IAM configurations on a case-by-case basis for buckets. The ScienceBase data release team has developed a process using Globus to help users get their data into this staging location, after which the files can be attached to ScienceBase items and moved into ScienceBase cloud storage via the application’s user interface (or via code). Who should use this file upload method?
Who should NOT use this file upload method?
What is the workflow for using Globus to transfer data to ScienceBase?
Case 2: Globus Deep Storage Data ReleaseThe ScienceBase Data Release team has also recently developed a process for releasing what the team is calling a “Deep Storage” data release. For these data releases, the data remain in a Globus Collection and public users will need a free Globus account to access the data. The ScienceBase data release landing page and the attached XML metadata record support the discovery and presentation of the data release, but file access is accomplished via Globus to navigate through the data release collection and obtain the data. Unlike the temporary Globus Collections used to support the S3 data transfer to ScienceBase (described above) these deep storage collections will persist on USGS on-premise or cloud storage configured as long-term cataloged collection. Who should use this file upload method?
Who should NOT use this file upload method?
What is the workflow for setting up a Globus Deep Storage Data Release?
|
Featured Data Release
U.S. Geological Survey, USDA Forest Service, Nelson, K., 2021, Monitoring Trends in Burn Severity Thematic Burn Severity Mosaic from 1984 to present (ver. 2.0, June 2022): U.S. Geological Survey data release, https://doi.org/10.5066/P9NETC0T.
USGS Data Owner: Earth Resources Observation and Science (EROS) Center
The Monitoring Trends in Burn Severity (MTBS) program maps wildfires that occur throughout the contiguous United States. Data points collected such as frequency, size, and severity of wildfires allow for analysis of the effects these events can have over time and space. This release contains a burn severity mosaic for the years between 1984 to 2021.
The related publication, which investigates changes to the mapping procedures and data products that have occurred in this timeframe, has been cited by 30 other publications. While many of these uses of the data are to classify frequency and perimeter trends, others have used the measures of severity to investigate vegetation regrowth (Moressi and others, 2022 and Li and others, 2022), or how wildfire impacts snowpack (Giovando and Niemann, 2022).
References:
Giovando, J., and Niemann, J.D., 2022, Wildfire Impacts on Snowpack Phenology in a Changing Climate Within the Western U.S.: Water Resources Research, v. 58, no. 8, https://doi.org/10.1029/2021WR031569.
Morresi, D., Marzano, R., Lingua, E., Motta, R., and Garbarino, M., 2022, Mapping burn severity in the western Italian Alps through phenologically coherent reflectance composites derived from Sentinel-2 imagery: Remote Sensing of Environment, v. 269, p. 112800, https://doi.org/10.1016/j.rse.2021.112800.
Li, Z., Angerer, J.P., and Wu, X.B., 2022, The impacts of wildfires of different burn severities on vegetation structure across the western United States rangelands: Science of The Total Environment, v. 845, p. 157214, https://doi.org/10.1016/j.scitotenv.2022.157214.
How to Make Your Data Release More FAIR: Accessible
The FAIR (findable, accessible, interoperable, and reusable) guiding principles for data, first outlined in Wilkinson and others (2016), have quickly become a popular way to assess and improve the usability and utility of scientific datasets. However, it can be difficult to glean practical and straightforward ways to implement the principles in your own data releases. We will explore a few small ways to make your data more FAIR in the next few Updates, continuing with Accessible (see the Summer 2022 Updates for the piece on Findable).
Using the ScienceBase data release process ensures that a few of the principles under Accessible are already fulfilled for you. For example, through the revision process, we ensure that metadata records are available even when data are no longer available, and we maintain ScienceBase as a repository that is free and open to the public. Here are a few other simple ways to make your data more accessible on ScienceBase.
Web Services or Direct Download?
When creating your data release, it’s important to consider how your users will primarily access the data: through web services or by direct download.
If you anticipate workflows in which the data are read directly from the ScienceBase item via web services:
-
Certain geospatial file formats are recognized by ScienceBase and can be displayed in preview maps and used to generate web services. These are shapefiles (.shp), GeoTIFFs (.tif), and ESRI Service Definition files (.sd).
-
Uploaded spatial zip files must be unzipped for ScienceBase to recognize the format. When one of these geospatial file formats is uploaded, ScienceBase will recognize the format and bring up a popup window, asking if an extension should be created. Selecting "Create Extensions" will allow ScienceBase to display the file in the preview map and generate web services for the data.
-
Web services can make your data release more accessible if is the files are intended to be primarily accessed programmatically. With ScienceBase’s Web Map Services (WMS) and Web Feature Services (WFS), spatial data can be viewed in in client-side GIS software or online visualization tools like ArcGIS Online, The National Map (TNM) Viewer, and other applications.
-
ScienceBase also now supports programmatic access to cloud optimized file types such as cloud optimized GeoTIFFs (COGs). Learn more about providing access to cloud optimized files here or in the Fall 2020 Updates.
-
While they can help meet certain data needs, users should remember that spatial services are not required and can motivate other considerations that may introduce more complexity in some cases than is necessary (e.g., optimizing display and performance in mapping applications versus preservation of data fidelity).
If you anticipate users primarily downloading data directly:
-
Spatial files uploaded in zipped format will not display in preview maps or generate web services; however, they will remain available for download.
-
If the data are not intended to be primarily accessed via web services, it can be best to keep them zipped. Having the data package zipped together makes the data more accessible to users downloading the data directly.
-
If you’d like to display a map of the study area of the data but don’t need web services, you can upload the study area map as an image, and it will display in the top right corner of the landing page.
Updating the DOI
The digital object identifier (DOI) associated with your data release is key to keeping your data accessible by providing a persistent link to your data. However, DOI links can break if the DOI’s record is not kept up to date. ScienceBase uses the DOI Tool to reserve and publish DOIs for USGS data releases. When your data release is published, the DOI record is updated with the URL of the ScienceBase landing page to which the DOI should resolve. If the data are moved from ScienceBase for any reason, or the landing page is removed, the DOI record must be updated to keep the data release accessible.
During data release revisions, data authors should work with the ScienceBase team to ensure that the DOI is pointing to the correct URL. If data need to be moved from ScienceBase from any reason, contact the ScienceBase Data Release team (sciencebase_datarelease@usgs.gov) to ensure that the original DOI link is properly redirected.
Utilize tagging
Using the tagging feature in ScienceBase can make your data easier to query and retrieve . When data are consistently tagged, users can pull together and traverse relevant results more easily. For example, users looking for water quality data releases in ScienceBase could use the query string:
to see all results with tag type “USGS Scientific Topic Keyword” and tag name “Water Quality”. From there, you can download a CSV that includes the item IDs of all data releases returned in the search, making it easier to programmatically access files on these pages or otherwise interact with the data.
The Denver photo library is an example of consistent and thorough tagging, with the tags also being utilized in the accompanying photographic library explorer. To add tags to your data release landing page, you can manually add tag types and tag names in the “Tags” tab in the edit form, or parse your metadata’s keywords onto the page by uploading your metadata file and selecting “yes” when asked if you’d like to propagate the metadata to the page.
Subscribe to the ScienceBase Mailing List for Quarterly Updates.