Skip to main content
U.S. flag

An official website of the United States government

Histogram-based gradient boosted regression tree model of mean ages of shallow well samples in the Great Lakes Basin, USA

April 18, 2024

Green and others (2021) developed a gradient boosted regression tree model to predict the mean groundwater age, or travel time, for shallow wells across a portion of the Great Lakes basin in the United States. Their study applied machine learning methods to predict ages in wells using well construction, well chemistry, and landscape characteristics. For a dataset of age tracers in 961 water samples, the mean travel time from the land surface to the sample location (center of saturated open interval) was estimated for each sample using parametric functions. The mean travel times were then modeled using a gradient boosting machine algorithm with cross validation tuning of model hyperparameters. The model contained in this data release was converted from the original model in the R language developed by Green and others (2021) to a python-based histogram-based gradient boosting regression tree (HGBRT) model (Pedregosa and others, 2011). Conversion to python facilitate the model's use as a support model for a groundwater nitrate decision support tool by Juckem and others (2024). The hyperparameters of the HGBRT model were adjusted using a Bayesian optimization algorithm (Head and others, 2021), with a goal of producing similar results as the original model by Green and others (2021). A total of 72 predictor variables were used for model development, including basic well characteristics, soil properties, aquifer properties, hydrologic position on the landscape, recharge and evapotranspiration rates, water quality constituents, and land use. Model results indicate that the mean of the natural logarithm of mean groundwater age for the wells used to train and test the model is 3.39 ln(years) with a root mean square error (RMSE) of 0.76 ln(years) for the holdout data to the HGBRT model. The RMSE for the holdout data (0.76) is similar to the RMSE from the original model for holdout data (0.84) reported by Green and others (2021). When the simulated values from the HGBRT model are back transformed from log space, the mean groundwater age is 55.9 years with an RMSE of 35.4 years for the testing data (Green and others, 2021 do not report matching results; simulated ages are reported for predicted ages at 14,335 non-sampled wells). The increased relative RMSE for back transformed ages reflects increasing error as values increase in the untransformed values.
Aside from the overall HGBRT methods contained as part of a python script, this data release includes a self-contained model directory for recreating the HGBRT model published in this data release. Three directories are available within this data release that define: 1. python attributes and input predictor variables, 2. model input and 3. the output model. The output directory also includes a model object (age_ml_model.joblib) for the HGBRT model used to predict the natural logarithm of the mean groundwater age. This model object is used directly by the groundwater nitrate decision support tool by Juckem and others (2024).

Publication Year 2024
Title Histogram-based gradient boosted regression tree model of mean ages of shallow well samples in the Great Lakes Basin, USA
DOI 10.5066/P9LFX0XP
Authors Leon J Kauffman, Christopher T Green, Katherine M Ransom, Wonsook S Ha
Product Type Data Release
Record Source USGS Digital Object Identifier Catalog