Big tools for big data

June 9, 2022

Nirav Merchant discusses how UArizona founded and grown resources like CyVerse and Data7 support data-driven research, both on campus and across the globe.

Image
Science Talks Podcast Episode 35 Big tools for big data featuring Nirav Merchant

Some of science’s most difficult challenges like solving world hunger and curing disease involve more than just traditional science fields such as plant and cell biology. Data science brings together computational, statistical, and mathematical techniques to expand our knowledge from big datasets collected from the bench, field or computer. Handling these large datasets can pose a challenge, especially to researchers who might not be familiar with or have the infrastructure to handle big data. Nirav Merchant is the co-principal investigator for CyVerse and Director of the UArizona Data Science Institute. Nirav is an expert in developing computational platforms and enabling technologies towards improving research productivity and collaboration for interdisciplinary teams and virtual organizations.


In 2018, you gave a lecture for the College of Science’s series on Humans, Data and Machines. You told the audience that machine-learning-based systems are becoming their new co-workers. Can you tell us a bit more about this - what exactly is machine learning, and how will we, or are we currently, working alongside them?

It's been a while, but I can definitely say for many of the topics we touched on in 2018 I can report on progress, and I'll give you some good examples of how machines have started to become our coworkers.

First, let me tell you a little bit about machine learning itself because people make it more complex. It’s actually another simple way of looking at what you have embedded in your data. There are a lot of patterns and knowledge that sit in your data, and you're looking to discern those patterns and then utilize them to make either predictions or to look forward so that when you find something that doesn't match exactly but it's close, you can get a reasonable answer.

Machine learning techniques borrow from statistics, which borrows from computer science, which borrows from math. Then they take these techniques that many of them have been around for decades but never saw popular adoption. In many cases, things like GPUs, which are now very popular thanks to cryptocurrency, are hard to get for science because the cryptocurrency market picks up GPUs, making it expensive for us to get them.

Because of these GPUs and other advanced methods and hardware, people were able to process a lot of data that they could not do before. Hence, with machine learning methods, they are now able to train them. When you say training for machine learning, you're giving it prior information and saying, “Build those networks…build that pattern recognition for me, so that when I give you something unknown, you can give me a reasonable answer.”

It's never perfect but it's close. Keep in mind that machine learning, much like teaching a child or a human being, from time to time, you need continuing education. There is no perfect machine learning method. You have to go back and realize it's not performing well for something, and then train it - give it more data to explain. Then it's going to perform better

I would say the most interesting place where I’ve seen machine learning methods become my coworker is in the last week. I started noticing it even more closely that there are services that you can give a lot of text, whether it's a research paper, a newspaper article, or even a website, and they will synthesize all that information and give you a summary. So instead of you reading the whole article, you're now going to get the key, unique content as if that's your summary statement from a lot of data. 

In the 2018 talk, I gave examples of how attorneys were using it for the law firms, and now you and me can get access to that for $6 a month. That's where it's becoming a very short time, a highly commodity system. I've been impressed by some of the results that I’m seeing – again, they're not perfect, but they’re definitely better than me skimming through paper very quickly and missing some of the key pieces.

As another example, I like to cook and collect recipes, but I'm horrible at organizing them. I find them in different places. When you look at a recipe, you have to convert that into ingredients, and each recipe that you find at a different site is organized differently. 

Now there is a machine learning based company that takes those recipes from different websites and it does machine reading - it reads that recipe, it extracts the ingredients, and now it creates a template for you that is consistent regardless of where you got that recipe.

 

You direct CyVerse and the UArizona’s Data Science Institute, or Data7. Both of these initiatives help researchers to use large data sets to ask - and hopefully answer - more complex questions. Talk to us about why large data sets hold the keys to understanding life’s mysteries, and how do initiatives and resources like CyVerse and Data7 play roles in this innovation and discovery?

In data science, over time it became very evident that we were all becoming experts at producing data, whether we were wearing sensors or the machines that we were using were producing more sequencing data or larger images. 

We are gathering a lot more data than we ever were, and it's not going to change anytime in the near future. What most people were struggling with was how to manage that, and when we say manage data, it doesn't mean storage - it means from its birth to its demise or its publication.

The goal is to do that in a good way so you're able to get the most value from that data. If you're just going to collect it and not be able to utilize it, you're just going to be losing money.

The volume of data was what was overwhelming people, so the first thing we wanted to do was give our researchers an upper hand and say, “If it's a large collection of data and I have collaborators all over the world, I should be able to do that without having to think of how I will upload this data, share it securely with them, and give them access to see live changes if they are in Australia.”

When you think in those terms, you have things like Box or Dropbox - they're very good for documents that are spreadsheets or small files, but the moment you start thinking of scientific data, that's pretty large. We were realizing that a lot of our researchers were spending time struggling with this, and because they could not share that, they were confined to whatever they could do in their own labs.

CyVerse tries to break that barrier down and say that all of us are part of a virtual organization. What I mean is, yes, we are University of Arizona, but to solve a problem we reach out to a colleague that might be in a different university in a different city, or it might be somebody across the building. If the pandemic has taught us anything, it is that working remotely takes a different kind of skill, but once you get a hang of it, you can very quickly assemble a group of people in multiple time zones and get the work done.

At CyVerse, 10 plus years ago, we really subscribed to that mindset, and we said, “What can we do to create virtual organizations that could bring people in that could help solve something, and then they can leave when that task is done - they don't have to be hanging around?”

That's how we can get subject matter experts to contribute - if you have everybody join and take a long time to be part of the team and then contribute what they know it is frustrating for many in a very complex project. Our goal was to make that easy. 

Fast forward to CyVerse, we’ve built many things, including our data store which handles petabytes of information and regular scientists who do not need any sort of major IT help can use their own laptops and do things that normally would have taken long or would not have been possible at all for them.

We took the lessons from there and applied it to the Data Science Institute where we noticed we have a lot of talent on campus that comes from different departments, but our scientists are again struggling to connect with that expertise. We asked, “How do we put that together in a way, such that people are able to quickly solve a problem and then either do a proof-of-concept work that they can go get a bigger grant or tackle a bigger project, and how do we cut across different disciplines in cyberspace?”

We were doing a lot that focused on life sciences nationally and internationally with CyVerse, but Data7 focused on doing that for our campus specifically. 

The other big piece was asking how we train people. CyVerse also provides a lot of training, but we realized that many of our faculty colleagues that are working on this, when they were postdocs and graduate students these technologies didn't even exist - the Cloud itself was not there. Now, they are in their labs and they have spent a lot of time working on something, but they don't have time to pick up some of these new technologies. Their students want to do it, but they also don't have an avenue to do it, so what is it that we can do for our students and faculty such that it's complimenting the coursework? All of us have coursework that is so packed – there’s no scope for adding yet another class on some of these topics, so how do we bring together people that are going to help us develop the workforce on our campus that is data science aware and literate?

That's how we started, in collaboration with the library’s Dr. Jeff Oliver and engineering’s Dr. Vignesh Subbian, the Data Science Ambassadors program, where each college provides us with a graduate student who we provide training and some funding. Then the students collaborate with each other to figure out what is needed in the college and how Data7 can help bring that training to them. They are the vehicles of sharing that knowledge within their college and across their college with their peers.

That's now grown further - we also have Data Science Fellows which are PhD students who are working on topics where once they graduate, somebody from another department wants them to work with them on that, so that this opens up the door for multiple labs to work together.

Finally, we have the Data Science Educators. They’ll be working on different topics and bringing training, so any given day on campus, you're going to have a lot of options to go learn about good software hygiene, good data hygiene, and all the popular data science tools. 

The first step to doing data science is having a good control on your tools - understanding how to use them, how to organize them, how to collaborate with them, how to organize your data. Once you do that, you can ask higher order questions, but if you don't have your data organized and the first thing you do is to seek a collaborator to work with you, and if they see your data is in some sort of shambles, they are not willing to collaborate with you because they know it's an uphill battle.

Along the same lines, we've been very fortunate with the BIO5 KEYS program, where they are also looking to bring in a lot more data science awareness. As part of having a wonderful organization like BIO5, we are going to work with the team that organizes KEYS so that the data science educators will be able to participate and contribute to providing some of the modules and training to the students to have a background in a lot of these methods.  

A lot of the methods that we use for doing this kind of training come from something called a Software Carpentry, which is an international organization of which we are one of the founding members, where they focus on best practices for scientific computing, but they look at learners of completely full spectrum of learning ability and resources to ensure the learning material is inclusive. 

We put all our students and the ambassadors through the same software carpentry training so that they know how to build the material and how to communicate so that it's a welcoming space to learn. We also have something called Coffee and Code which happens right here next to the Keating Building, and there's a Hacky Hour around town. These are all community-organized activities that we participate in and support because our goal as an institute is to build that community so that others can come and participate and learn.

 

You’re part of a large group of interdisciplinary researchers at UA studying the long-term effects of COVID-19, as part of the NIH RECOVER initiative. Tell us a bit more about the goals of this project, and your specific role in it?

We are fortunate that most people recover from COVID in a way that they're able to conduct their daily life and go back to some level of normalcy, but there are others that over time are not able to go back to some level of energy and health that they had before they were infected.

It's an evolving landscape where we have to take everything - from measurements of the variables, their electronic medical records, their cognitive abilities - and look at the early indicators for people who are suffering from this condition and ask how it can be diagnosed and, more importantly, treated so that we can help the person recover and come back to better health than where they are today.

As part of that, there are many dimensions of data and data gathering, so Dr. Subbian who also is the Associate Director or Center for Biomedical Informatics and Biostatistics is leading the team that works with multiple sites, much like ours, that NIH set up that send the data. We make sure that data is harmonized in a correct way so that when we pull data from across the whole country and, possibly, the whole world, there’s power and strength in that analysis and it's not sort of one off.

Hopefully with this patient data, we can put these variables together, such that there is a way to get a good view of what is happening in their lives, and then we can use that to help others that are possibly infected and going down the same path to mitigate that.

 

We like to think of our researchers as superheroes, so if you were a superhero, what would your superpower be?

My superpower would be to make the person relaxed and comfortable when dealing with complexity.


Watch Science Talks on YouTube