Finding Insights in Big Datasets

Dr. Rick Gill shares techniques scientists can use to deal with the overwhelming challenges of big data.

In this webinar, Dr. Rick Gill, an ecologist at Brigham Young University, discusses some of the transitions scientists have been forced to make with the advent of big data and specific techniques scientists can use to deal with the overwhelming challenges of extremely large data sets.

Next steps

Questions?

Our scientists have decades of experience helping researchers and growers measure the soil-plant-atmosphere continuum.

Presenter

Dr. Rick Gill, ecologist at Brigham Young University

Webinars

See all webinars

Hydrology 101: The Science Behind the SATURO Infiltrometer

Dr. Gaylon S. Campbell teaches the basics of hydraulic conductivity and the science behind the SATURO automated dual head infiltrometer.

WATCH WEBINAR

Advances in Lysimeter Technology

Stay current on advances in lysimeter technology.

WATCH WEBINAR

Water Potential 101—Making Use of an Important Tool

Master the basics of soil water potential.

WATCH WEBINAR

Case studies, webinars, and articles you’ll love

Receive the latest content on a regular basis.

Transcript

0:10
Good morning, and welcome to today’s virtual seminar titled, “Finding Insights in Big Datasets: Surviving the Data Deluge,” presented by Dr. Richard Gill.

RICHARD GILL 0:18
Well good morning, I’m excited to be here and to talk a little bit about some of the transitions that we’ve been making in using instrumentation data over the past five or six years. And a little bit about me. I used to be a professor at Washington State University, and interacted on a number of projects with some of the developers here at Decagon. And it’s been a relationship that’s continued since I’ve been at Brigham Young University for the last seven years. And so, today, I wanted to tell you about some of the approaches that we took, that I took as a novice in instrumentation. I was trained as a plant ecologist, that at no point in my school training, did I get any help in understanding how to use instrumentation. And so I’m going to show you some of the transitions that we made over time and where we’re at now to use on big datasets. And we’re at a really exciting time now in terms of data in that data are becoming cheap, and they’re abundant, and there’s lots of open source data that we can use. But there was a paper that came out in science in 2011 that talked about the environment we’re in now and Richard Baraniuk said the data deluge is changing the operating environment of many sensing systems from data poor to data rich—so data rich, that we’re in jeopardy of being overwhelmed. And I have certainly felt that. And I feel that some of the most important decisions that we’re making now as scientist is how we’re going to operate in this data rich environment. And so it’s what I wanted to do is to tell you about three projects that I’ve been working on with people in my lab, starting with sort of our novice approach to understanding instrumentation and data management into a project that my current PhD student is working on, looking at snow melt hydrology, where he solved a number of the problems that we identified in the first project. And then finally, one where we’ve been much more deliberate about data management and how to deal with with large datasets from the beginning and where I think we’re finally getting a good handle on how to how to manage large datasets. And so really, what I wanted to do is use some work that we’ve been doing on the Wasatch plateau that started in 2002, and show you some of the mistakes that we made, then transition through and show you how we started solving some of the problems that we identified in the first project and then finally, looking at how understanding what the challenges are from the very beginning allows us to anticipate problems that we’re having, or that we’re likely to have, and manage data in a way that I think in real time will be able to address some of the questions that we have.

RICHARD GILL 3:44
So when we started on the Wasatch plateau, we were really interested in looking at how summertime precipitation influenced ecological processes. And so the Wasatch plateau is a site where ecologists have been doing research since 1911. It’s really well understood and characterized. But we identified some interesting scientific questions associated with climate change. We thought through, if we were going to measure things continuously in the environment, what sensors would we need? We then went out and bought them or constructed them, installed them, hooked them up to data loggers. Two or three times a year, we would visit the field site, download those data onto a laptop, exported it as an Excel file, and then as the multiple investigators on the project needed data related to the sensors, we would either email Excel files back and forth or finally set up a Dropbox account where we could share data that way. And then individually, we analyzed this data. And I think that this is a pretty typical workflow when you when we first started doing this type of work where we really could pay attention to every single datum. And that the questions that we were interested in, this is a simple schematic that looks at water dynamics through the growing season— or through the year on the Wasatch plateau. And really what we were interested in is looking at how changes in the amount or the timing of rainfall during the summer in the subalpine system, altered ecological processes. And so from a sensing perspective, what we were really interested in is some micro meteorological information, so air temperature, rainfall, but most importantly, looking at soil moisture. And so we put out a number of Decagon sensors and hooked them up to Campbell data loggers to look at the timing of the dry down after snow melt, and how much soil moisture changed with individual rainfall events.

RICHARD GILL 6:16
And these are the sorts of data that we got. And so we have three years worth of data here. We’ve got 2010, 2011, and 2012 during the summer. And so in the upper three plots, we’ve got air temperature— maximum air temperature, minimum air temperature, and then some bars show individual rainfall events. And then in the lower panel, what we have is water potential that we derived using moisture releases curves and measuring volumetric water content in the field. And what we see is that we have three very different years in terms of soil water dynamics. And we see that our treatment effects are what we would expect, that when we alter the timing of rainfall, that we can increase the dry down periods between individual rain events, and our drought treatments were able to to reduce soil moisture significantly. And from these data, we could actually look closely at some ecological phenomena. And so while we were interested in how average water potential influenced the survival of tree seedlings that we were interested in the expansion of tree line into the subalpine, and we found that average seasonal water potential was a good predictor of seedling survival. And it was a much better predictor when we looked at end of season water potential versus the likelihood of these seedlings surviving over the winter. And then we could also from these datasets pull individual dates out. And so when undergraduate students went out and measured gas exchange in the seedlings, we could look at carbon acquisition rates, so photosynthetic rates versus the water potential. And so we could use these well resolved datasets to address these sorts of questions. But it was all manually done. It was us going in and identifying the period that we were interested in, averaging it, we did all of the post processing in Excel. And it was kind of clunky.

RICHARD GILL 8:43
And as we went through this process and really reflected on the issues that came with it, some of the most important that we identified is that because we weren’t looking at the data in real time, and oftentimes would go seven or eight months without being able to access our data, we ended up with data gaps. And so for example, here we see in 2011, that right in the heart of the growing season, right when the data were most important to have and to look at, there’s a multi week gap. And it just happened that we downloaded the data and something happened as the students were downloading the data in and somehow they turned off the data logger. And so we ended up with a big gap. And we didn’t recognize that gap until we went back and downloaded the data again. The other thing that we had is that we didn’t have a very good protocol for how to manage the files once we had them. And so we had problems with version control, where one investigator would go in and play around with the data and sort them and analyze them, and convert them to the units that they were interested in and save it. And the next person that came along didn’t know if they were using raw data or converted data. And so there were issues with that. The other thing that that we ran into, is that this project has been going on long enough that the initial students that were working on it have left. And so we have new students coming in, and we found gaps in our metadata as well, understanding exactly where every sensor is located and what it’s measuring. And we ended up, especially as we moved to Dropbox, where we could simultaneously access files from both in Pullman and in Provo, and wherever other people were at, we ended up saving different versions of the file, and so the lower graphic there actually shows what our Dropbox file looks like. And what you see is that we have multiple files, and this is even with us going in and cleaning it up, we have multiple files that have the same exact data in them. And so we have problems with redundancy, with following modifications. And so there were some real issues that arose, that the data were valuable, but I think that there’s a better way to handle some of the problems that we ran into.

RICHARD GILL 11:35
All right, so we started another project. And as we did this, I brought in a PhD student named Lafe Connor. And Lafe has been really proactive in dealing with some of these data management problems. And what Lafe is interested in initially was the phenomena of dust on snow. And so in the western US, one of the more interesting things that has been identified as a potential modifier of hydrology and mountains is spring wind storms that blow dust onto snow. It changes the albedo of the snow, and it changes the rate at which it melts. It accelerates snow melt. And so he started his PhD project looking at the sensitivity of subalpine systems to snow melt timing. And he modified the conventional approach in that after he identified what his question is, he just determined what his sensing needs were. And he was really fortunate in that he got a Grant Harris fellowship, so that he could get some extra instrumentation to install at his site. But before he installed that, he actually thought through the process of how he was going to manage those data as they came out and established some protocols for version control and other things. And then he went out and he installed data loggers and soil moisture sensors. Again, it was a remote site, the data stayed on the data loggers and he would install them in the fall, they would overwinter under the snowpack, he would go in and download those data early in the spring as soon as he could access them. But now instead of saving them as individual Excel files, he developed a database that he could store them on. He identified what were the dynamics and soil moisture that he was interested in. And I’ll speak to that a little bit. And so he could automate some of the process of analysis. And then after he analyzes them, he can also make those data available to the public.

RICHARD GILL 14:03
And so this is Lafe doing one of his, applying one of his treatments. He skied out to one of his plots and he’s adding dust to the snow to change albedo and to change snow melt timing. And what he was interested in is understanding how advancing snow melt timing changed the pattern of soil moisture in the system, and he thought about it theoretically. And so really what we have here is typical soil moisture from October to September, so in a water year. And what we see here, at least in theory, is that adding dust to snow should accelerate the point at which you introduce more liquid water into the soils and so you start the onset of the spring flush earlier. But you would also get rid of the snowpack earlier. So you would start the spring dry down sooner as well, with the potential to increase the duration in which these soils are dry until monsoonal storms come in. And so this was his theoretical understanding. And so he thought about how it would be for him to have multiple plots, multiple sensors and to identify these important key transition dates, without just doing it by hand, not having to look at every single datum. So he said, really what we’re looking at here is changes in variance. And so he identified changes in soil moisture, right, so the variability of the data, so he could easily derive these using r. And so what we see is that for most of the winter, soil moisture is invariant. It isn’t changing very much. But as snow begins to melt, you start to substantially increase the variability in soil moisture. And so at the onset of that, that’s the transition from your winter low flux state, to the spring high flux state. And you have these pulses that go through the soil profile, and so you end up with that period where the snow is melting, and you’ve got massive fluxes through the profile. It then stabilizes a little bit, and so you start this, the end of the high flux state and into the dry down. And so during the dry down from day to day, you’re changing, but you’re only changing a small amount. And then finally, you reach some point where you have dried down to the point where it’s relatively stable, and enter the summer dry phase until finally you start getting monsoonal storms in July and August, and their individual storm events create high variability.

RICHARD GILL 17:16
And so what he did is, he’s actually written this up as a way to share his experience, and what he did is he would take his EM-50 data files, and so these are CSV files that have columns for every response variable. And what he was able to do is put together a fairly simple Python script that allowed him to strip out individual sensor data, so that they have a single sensor, a single time series that matches the protocol for the database that he was using, and created all of these files by hand, and then he would upload them as a batch to the database. And so now instead of worrying about version control, and where the raw data are, all of the raw data are now databased, and instead of doing analyses and storage in Excel files, the storage is taken care of in a HydroServer database. And he’s able to pull from that database to do his analyses. And so this is a database that’s publicly available, and you could actually go in and find individual sites. You can pull all the data from a site, select which response variables you’re interested in, the time period that you’re interested in, and analyze only those set of data that you need for the question that you’re asking. And so he was then able to pull from the HydroServer database, right in our program, and our script to identify those important inflection points. And from them, he’s able to, he takes these hundreds of thousands of individual data points, and he’s able to extract from them the important dates that he needed, and so he’s identified the day that he first applied the dust, the timing of the onset of the high flux period in spring, the point at which you are no longer having this high flux period and you’re starting to dry down, and the onset of the summer dry period. And so, by doing this, he’s substantially simplified the analytical process of looking at hundreds of thousands of data.

RICHARD GILL 19:49
So with this modified approach, there are obvious advantages that we found, I think, understanding in advance what you’re going to do with your data is absolutely critical. But we still have some of the same problems, that is that the data are all on the data loggers, when we have sensor failure, or in field settings you run into problems with wild animals and with rodents, and other things that cause sensor failures. And he wasn’t able to see those sensor failures until he showed up later in the year to download the data. The data was downloaded to the laptop, but it takes a significant amount of time to run the Python script to strip out the data, and to finally push it to the database. And really the big problem or the thing that is challenging as an investigator is the interval between when the data are collected and the time that you’re actually analyzing them, right. So it’s not until, you know, mid summer that you’re looking at your spring data, that there is this gap that you know, is a little dissatisfying, and it would be so much better if we could look at our data in real time and identify problems as they occur.

RICHARD GILL 21:14
And so we’ve finally got a project going on now, where we’ve taken many of these shortcomings that we’ve identified, anticipated them, and built solutions into the project itself. And so this is really the point at which, for me, it becomes exciting to see the potential of big datasets. And so from that same Science article, we read, “Managing and exploiting the data deluge require a reinvention of the sensor system design and signal processing theory. The potential pay-offs are huge as the resulting sensor systems will enable radically new information technologies and powerful new tools for scientific discovery.” And so for us, that we have an ideal system to look at this in, so we moved out of the mountains and into the deserts, and deserts in the West are experiencing a massive increase in wildfires. This is in large part due to the invasion of annual grasses. And so along with a number of other professors at BYU, we set up a large experimental system to look at the ecological factors that control recovery after wildfires in the West. In addition to this, one of the things that we were most interested in is how soil moisture controls the reestablishment of native vegetation, or that the increase in invasives as a result of the amount of water that’s available to them. And so we built rainout shelters in order to manipulate soil moisture, and the real value of a sensor network is being able to, with high precision, measure the effects of our treatments on soil moisture. And so we’re worried about things like cheatgrass, this is a invasive annual grass that is now found throughout the Great Basin and Columbia Basin. This is what it looks like in Lake fall. It germinates in late fall. It will overwinter as a green plant. As soon as the snows melt, it bolts, it flowers, it sets seed well before the natives do. It will then dry out and serve as a fuel if there’s an ignition event, and if there’s a high amount of litter, a high amount of biomass. The other thing that that we have in our Great Basin site is a toxic forb called halogeton. And halogeton has a very different life history in that it germinates in the spring, it will go through most of the summer, has a really small rosette, and then it will bolt in late summer, early fall and produce seed then. And so we have these two really bad invaders that are at our site. And we’re interested in looking at how soil moisture as well as fire influence their abundance.

RICHARD GILL 24:34
And so this time, rather than just jumping in, identifying what we should measure, putting sensors out, we actually went through a very deliberate plan on thinking about what we were going to do with the data as they came in. And since we had our initial question related to fire and soil moisture, we knew what sensors were available and that we could use. And we put together a data plan. And the data plan was, what data are we going to collect and how often? How are we going to do quality checks on those data and automate that process? How are we going to link the data that we’re collecting on a data logger to the important metadata so that it becomes useful, not just to us, but to others as well? We thought through the process of saving these in a data repository, and then figuring out tools to discover those data in real time. And then for us, the really exciting thing was being able to then use the sensor data linked with more conventional ecological data for real time analyses. And so we sat down and deliberately thought through what sensors we were going to deploy and where, how often our measurements were going to need to be be made, and what was possible with the budget that we were working with. And so as we sat down to plan, we actually stepped through this process. As we looked at data collection, we were interested in, what what sort of response variables could we look at, and so we decided we wanted to build in some redundancy, and measure both volumetric water content and water potential at the same plots for at least some subset of them. We wanted to measure reflectance, so measure real time plot scale, NDVI. And we also wanted to have micrometeorological measurements. We didn’t need those at every plot, but we certainly wanted a good characterization for the entire site. We had to think through how redundant we wanted this system to be. We had seen that we had had problems with sensor failure that comes with just having sensors in the field. And so did we want duplicates everywhere? And we realized that that was impractical, but for about a fourth of the plots, we could build some redundancy into the system. We also had to think about the data stream itself. How much data could we manage? At what time interval did we need to make these measurements and did they have to be the same for every measurement we were making? And so for the soil measurements, we ended up collecting data every six hours, for the micrometeorological data, what we did is we averaged hourly instead. And so we decided that the frequency at which we were collecting data was going to differ depending on the the actual response variable.

RICHARD GILL 27:43
In terms of data, we really were dissatisfied with this idea of waiting around and only when we go out to the site do we collect the data, and then analyze it when it gets back to the lab. We decided that, given the investment we were making in instrumentation, that we really wanted those data wirelessly, and we wanted to be able to look at them as soon as they were collected. And so we ended up setting up a wireless network in order to push the data to a server using EM-50 Gs. And we were pushing that data every single day. For some of the responses that we were looking at, we had the option of either collecting raw data or actually doing the transformations in the data logger prior to uploading the data. We chose to automate some of that. There are also things that we can do post collection. So initially, it’s archived. It’s saved as CSV files on the data loggers themselves. But most of the archiving is actually now occurring as those text files is being pushed to the central server. We then can derive variables and do some calculations. We had to think through at what point we were going to link the sensor data to the metadata. The data loggers themselves don’t collect information about which treatment the sensors are in, or what depth, and so we had to figure out a way to connect those things and put them into the database. We had to figure out what we were going to do when there were problems with sensors, how we were going to gap fill data, if we were going to or not, and then how we were going to synchronize all of our data across a platform. And so we thought through all of this before we put even a single sensor into the field.

RICHARD GILL 29:48
And so finally, we went out to our research site, and it’s been a really productive collaboration for me to be able to work with Colin Campbell here at Decagon. And we actually set out all of our sensors, got what we needed, and went through the process of the installation, and were able to test that things were working well there in the field. And so we had all of the data streaming to the server. The first real step that wasn’t already done for us, that is that Decagon hadn’t already thought through everything and sort of gave it to us out of the box, was the attaching of metadata to the sensor data themselves. And so before we could upload anything to the database, we had to put together a lookup table, and this lookup table just matched the data logger ID, the sensor ID to what it was actually measuring. And so what we could do is then, using this lookup table, for every datum, we could attach a treatment to the sensor itself and know what we were measuring. And so using a Python script that was written by a graduate student, we were able to automate the process of pulling data from Decagon’s server, attaching the metadata to it, and then putting it into our own database. We then again, use this HydroServer data repository. It’s something that we’ve used in the past. And now we’ve improved upon the initial model by now, instead of having to have a graduate student or an undergraduate strip out individual sensor columns, now we’ve automated that, and we’ve programmed it up. So now as soon as the data shows up on the Decagon server, we can run the process, and within minutes, it will be on the HydroServer database.

RICHARD GILL 32:26
And the next piece is really one where the open source community has been a huge benefit to us. And that is that the the Quasi funded network through NSF has actually built a number of data visualization tools that can be added to websites. So these are ODM tools, that if you can point them to the database, they’re able to allow you to interact in real time with time series data. And so that’s what we’ve been using and installing to analyze our data. And so these are available on the web, and they’re free, and they’re open source and you can modify them. And so there’s real opportunity there that one of the issues that I’ve dealt with is just my own naivete associated with the work that goes into data management, that I didn’t recognize how difficult this was, that when I first started this project, I was interacting with some of the bioinformatics people in my department, and sat down with some really advanced students and said, Here it is, it’s a relatively straightforward process. I’ve got sensor data. I want to get them into a database. I then want to visualize them in real time. And I want to be able to analyze them, identify sensor problems, also, to look at treatment effects, that sort of thing. And their eyes got really big. And I didn’t understand why that weren’t excited about this. And finally, I sat down with their professor. And he says what you just described is about a four year process, right? That this isn’t something that you just do casually. And I began to appreciate just how challenging it is to program something like this and do the web design and all the rest of it. But I also discovered, once I recognized that it was difficult, I realized that other people had done it already, and with every project, we don’t need to reinvent it. That these tools are available for us online. And so now we have these tools that are freely available and can be modified to meet individual needs.

RICHARD GILL 34:55
So now we have the possibility of pulling those sensor data out of the database, we can look at them in real time, and begin to develop hypotheses that align with some of the other questions that we have in the project. So for example, here we have an aerial shot of one of the blocks of our experiment. And it’s easy to see that the upper left hand corner and the lower right hand corner, these were two plots that were burned, that the absence of shrubs shows you that. But the other thing that you see is that there’s a huge difference in vegetation between those two plots. One is bright green, one is brown. And that’s because one of the treatments in this experiment is the presence or absence of small mammals. And so we begin to look at, Oh, there’s a small mammal effect on vegetation after fire. But why is that? And what we’re able to do then is pull sensor data, right, and so we pull the time series, the summer that preceded this image, and we can look at water potential between these two treatments. And what we see is that the one that is absent, the green, the one where there isn’t abundant halogeton, it was much drier over the course of the summer. And it was much drier because cheatgrass was more abundant. And so we were able to mechanistically link the way that cheatgrass and halogeton interact with one another, even though they’re growing at almost exactly opposite times in these plots. And so now we can realize the full potential of these sensor data by linking them with more conventional ecological data.

RICHARD GILL 36:48
And so I just wanted to finish up with this idea that these are exciting times for sensor system design, that as we have identified that the large issues that come with managing datasets and as more and more tools are being made available online, the full potential of installing a sensor network is being realized. That we’re being able to measure more, gain insights into processes that otherwise were unknown to us, and address real world problems, but only after we figure out how to deal with the data.

icon-angle icon-bars icon-times