Eyes on Earth Episode 58 - Satellites and Cloud Computing
Detailed Description
Satellite imagery is everywhere. We see it on TV news and weather coverage, in our Twitter and Facebook feeds, and on our phones’ mapping apps. The data behind that imagery is nothing like a screenshot, though. It’s comprised of tiny packets of data, broken down from huge files and digitally manipulated to resemble the surface of the Earth, a swirling storm system or a map of urban growth. Cloud computing resources can make it easier to work with huge datasets that cover long periods of time, which is why many remote sensing scientists are using it for their analyses. On this episode of Eyes on Earth, we hear from a scientist who used the cloud for a 150-year water use modeling project, and from a data scientist working to help train others to use cloud resources.
Details
Sources/Usage
Public Domain.
Transcript
JOHN HULT:
Could you even get a machine that's as powerful as a cloud?
AARON FRIESZ:
You can, but it's very, very expensive, and so there's a cost efficiency here. You're paying for only what you use. You don't have to, up front, say "I need this x amount of dollars server so I can run my analyses. You pay as you go with the cloud.
HULT:
Hello everyone, and welcome to another episode of Eyes on Earth. We're a podcast that focuses on our ever-changing planet and on the people here at EROS and across the globe who use remote sensing to monitor and study the health of Earth.
I'm your host for this episode, John Hult.
Satellite imagery is everywhere. We see it on TV news and weather coverage, in our Twitter and Facebook feeds, and on our phones' mapping apps.
The data behind that imagery is nothing like a screenshot, though. It's comprised of tiny packets of data, digitally manipulated to resemble the surface of the Earth, a swirling storm system or a map of urban growth.
These are huge files. A single Landsat scene of around 115 square miles is well over two gigabytes - larger than some full-length films.
It takes a lot of computing horsepower to download and analyze large satellite datasets, particularly when the goal is to answer questions about change over time. That's one of the reasons why remote sensing scientists are increasingly turning to cloud computing, which allows users with cloud accounts to analyze, visualize and interpret satellite data without downloading it and storing it locally.
It's also why the USGS created a cloud-optimized format for Landsat data with Collection 2, and why NASA's Land Processes Distributed Active Archive, or LP DAAC, is developing tools to help scientists learn how to use the cloud.
Joining us today to talk about cloud computing are Stefanie Kagone and Aaron Friesz. Kagone is a remote sensing scientist and contractor at EROS who's used cloud computing to analyze decades of satellite data to study water use across the United States. Friesz is the science coordination lead for the LP DAAC-which is located at EROS-where he works on tools that help get users started with the cloud.
Stefanie, Aaron, thanks for joining us.
KAGONE:
Thank you for having us.
FRIESZ:
Thank you, John.
HULT:
Let's first talk a little bit about the cloud. What is it, and how is it useful for working with satellite data? Aaron, would you like to take that one?
FRIESZ:
The cloud is essentially servers and compute resources that you or anyone can access over the Internet. Cloud computing delivers scalable, on-demand, pay-as-you-go access to pools of compute resources. In addition to having the servers that store data, massive amounts of data, you have resources like RAM and GPUs and CPUs that allow you to perform your analysis. The marrying of satellite data being stored in the cloud is perfect. We don't have to rely on your local machine and the resources that come with that to perform your analysis, nor do you have to download massive amounts of data and have the infrastructure to store that. Now you can log in to a cloud account and access those data where they are saved and perform your analysis next to the data. Having on-demand resources gives you the ability to perform at-scale analysis.
HULT:
So when you say "at-scale analysis," can you give us a real quick definition of what you mean there when you're saying that?
FRIESZ:
Oftentimes, you have to narrow your research area of interest down. And so, rather than focusing on South Dakota, I can scale my research out to the United States, or even global scale. So I'm not restricted by my resources to perform these analyses.
HULT:
In other words, the cloud behaves like a great, big supercomputer on the user side, and instead of needing to have a giant, powerful machine, you can rent time and do the work without having to do the downloading or pay for that hardware. Correct me if I'm wrong here, but anybody can access this quote-unquote supercomputer, anywhere in the world. It doesn't matter where you are, whether you're at a fancy research facility, a fancy university, or somewhere much smaller, like, you know, your own house, you can use the cloud, right?
FRIESZ:
Totally, yeah. Anywhere you have Internet access. You're not restricted by your location, but also by the device that you're using. You don't need a massive desktop. I can run a workflow from a cell phone.
HULT:
You don't have to have a fancy computer to get in there, and you don' have to have a fancy computer to work with the data and work at scale. So Stefanie, let's turn to you now. Tell us a little bit about your research focus, which is evapotranspiration. First tell us what is evapotranspiration, and how have you used satellite data and the cloud to study it?
KAGONE:
Evapotranspiration, or short "ET," as we call it since it's quite the word, is a combination of evaporation and transpiration. And evaporation, usually that comes from soil, and transpiration comes from plants. It's an important part of the water cycle. About 60 to 70 percent of the water received by precipitation in that cycle is really recycled back into the atmosphere in that way.
HULT:
And it's invisible, right? So this 60 to 70 percent, you can't see it. You can see a rain gauge, but you can't see this, right?
KAGONE:
Exactly. So that's what makes it a little bit difficult to grasp or to model, since you're dealing with water vapor that is usually invisible to you.
HULT:
Okay, okay, so how do you use satellite data to track ET?
KAGONE:
We know that if a plant is healthy, it produces a lot of ET, if it has enough water available from the soil in the form of soil moisture, if there's enough energy through sunshine to convert that water back into water vapor. And so ET, therefore, is also a measure of how healthy vegetation is. Or you can see if the vegetation would be under stress by drought conditions or even a fire. ET can tell how much irrigation water is needed, to keep, let's say, a field of crops healthy throughout the growing season. So we use the remote sensing data, let's say for example from LP DAAC, from Aaron's group, and from other sources, like weather datasets, like wind speed, etc, to estimate the ET values at various scales. And that's from like global extent, for drought monitoring purposes. You can also do field scale, using Landsat data, for example, to determine water use or the irrigation water need of a field. So ET is quite scalable from bigger, basin-wide applications down to a field scale application to look at really the plants itself and how healthy those are.
HULT:
Tell us how cloud computing has been valuable in that work. Because you talked about basin-wide, and even global scale. You were even able to look at 150 years of data using cloud computing?
KAGONE:
Yes, John, we did. We received a task to do 150 years of ET data for the Delaware River Basin on the East Coast from like 1950 to 2100, and the deadline was about nine months. I knew from previous model runs and experience that would not be currently possible with the tools I had. Just for the Delaware River Basin would have taken probably about five months for one run to finish. Therefore, we turned to the cloud. Like "okay, let's try it. It was a new concept to the team. Let's just try it, jump into it and see what we can do with it." And then, of course, using the cloud computing and bucket storage, that provided us really the resources we needed to complete the task. After we got the code and everything ready on the cloud, we were able to process 150 years in about five days. It enabled us for the first time to run the model over and over and over to actually de-bug every little bug that we had. We also could evaluate the estimates, and we could change the input variables into the model and run comparisons. And we could do it over and over and over, because it would take one work week to do that instead of five months to do it. At the end, we could deliver a high-quality product to our customers and to the stakeholders.
HULT:
If I'm going to summarize here, it sounds like you could have potentially run this 150 year model using the compute resources that you had, but it would have taken you five months and that wouldn't have given you really any time to deal with problems. Like you say, fix the code. So you turned to the cloud and you cut it from five months to five days?
KAGONE:
I couldn't believe it myself, and I wish I even would have used it earlier. But it's sometimes a little bit of a hurdle when there's new technology involved. But I really think it's really worth it for researching it or investing into it, or getting to know it a little bit. As a scientist, not just a computer person.
HULT:
I want to stay on this point for just a second. How much of a learning curve are we talking about? Did you have to spend 2-3 months at coding bootcamp or something to figure this our, or? How much of a time commitment was involved? Because I know that is a thing that is kind of a hurdle for people?
KAGONE:
The main thing there we really needed to do, we needed to adjust the code to be able to run on the cloud, and that took us one or two months. And then I'd say it was another month or two to learn your way around the cloud, the new terminology, and just get over the fear that new technology brings. It is really not so different than other software. It is just on a bigger scale.
HULT:
Aaron, let's turn back to you. When NASA polls its satellite needs working group about research practices, it still finds that a majority of users download data, rather than using the cloud. It's my understanding that the LP DAAC is working to both explain the benefits of cloud computing and help scientists learn how to use it. There's something called EarthData Cloud Cookbook, which is pretty entertaining, but there are a lot of tutorials and resources out there. Tell us a little bit about those efforts.
FRIESZ:
That download model has been the method of choice for a number of years because it was really the only thing that we had. But with the cloud, there's efficiencies to be gained there. No longer do you need to fill up hard drives with kind of these dark repositories of data. The LP DAAC has many efforts going. For a number of years now, we've developed these Jupyter notebooks that kind of give you a feel for working with data. We've started incorporating our cloud data, showing that it's not that big of a jump to take this Python, Jupyter notebook and execute it against a dataset that is in the cloud. The EarthData Cookbook is part of a collaboration that NASA has with Openscapes. We have a number of NASA DAACs that are involved in this effort. The LP DAAC involves land processes. There's other DAACs that have a community of oceans, or atmosphere, or cryosphere users. And so we're bringing all these groups together and starting to identify commonalities in our resources and see where we can collaborate and build on these tutorials that we have so that we can have more interactive, more cross-DAAC experience with our data.
HULT:
Let's talk about the benefits and drawbacks. What might those be? Aaron, why don't we stay with you on that one.
FRIESZ:
The first thing now is what Stefanie just talked about: being able to do big data analysis in a much shorter amount of time, and being able to iterate, too. There's a lot of tinkering that goes on to come to a model that is representative of a system.
HULT:
When you say iterate, you're talking about the fine-tuning that needs to happen?
FRIESZ:
Absolutely. Stefanie said that they had a 9-month deadline. There's really no room for mistakes if you have a model that's taking five months to run. But if you have something that you can run in a week, then you can fine-tune, re-run, validate, re-run, and do the scientific method on those.
HULT:
It's not just the ability to scale up and do these large datasets, but it's also the ability to fine tune and really just improve the products that you are creating.
FRIESZ:
Absolutely. Yup. And then also, another benefit is being able to combine data that was really never able to be combined when we were working from local machines because of the restrictions of resources, of storage. With that data in the cloud, you can start combining data and start gaining new insights that were not achievable when we were working from local machines. Drawbacks may be a little strong, but yes, there's cost factors, of course. It is a pay-as-you-go model, and so on the best of days you could be paying for exactly what you used on that machine. Errors occur, and sometimes people leave machines running. And so when those machines are running, you're paying for that resource. There is definitely a learning curve. I think more and more there's applications that are making their way to the cloud so that those who are more familiar or more comfortable working from, like, a graphical user interface, will have that ability. But right now, those who are working in a scripting environment, this is really a sweet spot for them. The learning curve for those individuals is not necessarily understanding Python or R, but understanding the behind the scenes kind of cost calculations.
HULT:
Here's one for both of you: what kinds of questions can we ask with cloud computing that wouldn't have been possible 10 or 20 years ago?
KAGONE:
I see the value in doing more long time series analysis like the 150 years that we did. So we were looking backwards into time and also into the future, and we can do that not just for a small, small area or small basin, but we could do that for regions all around the world. And especially since remote sensing data is globally available, that would be a great benefit. If we would study more of the globe and the different processes, how everything is connected, and if we make changes to the environment, we could learn from those findings, we do globally and maybe apply locally.
HULT:
Aaron, what are you hearing about? What are you hearing people talk about?
FRIESZ:
There's now the ability to get more ideas. The open, inclusiveness of the cloud. Anyone who has an Internet connection can access the cloud. So this really opens the door for more ideas to come in, ideas from individuals, groups, who really never had a seat at the table whenever they were doing these types of scientific research because they didn't have the resources to actually contribute. Also now, it's easier to share, the process is more transparent, more open, it's community-driven at times, and so doing reproducible science is actually achievable now. No longer do we have to do the science in our lab, write up an article, publish that, sometimes it's behind a paywall. Now, we're definitely moving to a more transparent, a more open paradigm for doing science. Now I can bundle up my algorithm, the pointers to the datasets that I use, and I can shift that off and make it available to the community, and they won't have to reinvent my procedure. They can just open it up, deploy that container, and then they can make adjustments as they see fit.
HULT:
Any closing thoughts from either of you? Stefanie, why don't we start with you, since we just heard from Aaron.
KAGONE:
Cloud computing really is a great tool in your tool box, but it really shouldn't be seen as the only tool. Because we sometimes get, like, "oh, we just have to use the cloud," and I would really not necessarily agree with that. But together with your local laptop, with your computer, along with the cloud, or really even high performance computing for the heavy processing, we can really do great science here at EROS and around the world, and help to make a positive impact on the pressing issues of our time, including the droughts, floods, fires, even land use change.
FRIESZ:
I personally am excited about the prospect of innovation. I think we've seen the tip of the iceberg. I'm thinking it's just going to gain momentum, and we're going to see some really neat science coming out of this migration to the cloud.
HULT:
We've been talking with Stefanie Kagone and Aaron Friesz about using satellite data in the cloud. Stefanie, Aaron, thanks for a fascinating conversation.
KAGONE:
Thank you, John.
FRIESZ:
Thank you. This was great.
HULT:
And thank you to the listeners for joining us, as well. You can find all our shows on our website at usgs.gov/eros. That's u-s-g-s dot gov, forward slash e-r-o-s. You can also find EROS on Facebook and Twitter to see the latest episodes, and you can find us on Apple podcasts or Google podcasts.
This podcast is a product of the U.S. Geological Survey, Department of Interior.