Our public drinking water system requires investment and maintenance. There are hundreds of thousands of water main breaks each year and even when the system works well it loses billions of gallons to leaks annually. In most communities, sections of the water system are at least 100 years old—and likely include lead service lines.
This problem is compounded by lack of trustworthy records. Service line records are typically scattered across multiple systems, unorganized, incomplete, often illegible, and frequently contradictory or incorrect. Nonetheless, those historical records hold valuable information for data science teams—if they can get at it. As it stands, creating a usable dataset from these records is a largely manual process, time-intensive, and difficult to accelerate in any meaningful way.
Unfortunately, water utilities are persistently underfunded and underserved. Most won’t have the resources to attract and retain the sort of data science talent that they need to efficiently track down lead contaminant sources in our water systems—let alone afford to waste analyst time puzzling over faded tap cards.
Scaling the Data Science to Accelerate Lead Service Line Projects
According to Jared Webb, Chief Data Scientist for BlueConduit, “The beauty of statistics and machine learning is in how it scales with numbers. If you want to estimate the prevalence of lead in a population of 1000 homes, you need a very similar number of random samples as you need to do the same thing for 100,000 homes.”
Aaron Beckett, a software engineer with BlueConduit, agrees. “Having experts use their human eyes to look at all of this stuff doesn’t scale well.” But, he goes on to add that, while water utilities need a solution that is cost effective and scales well, they also need one that’s readily explainable:
“There have been a lot of historical equity gaps around lead service lines in underserved communities. As a result, there’s been a lot of trust lost. So we can’t just make a mystery machine that produces right answers. We have to have one that produces right answers that make sense.”
Aaron Beckett, Software Engineer with BlueConduit
Properly deployed, predictive machine learning can bridge these gaps, accelerating the work of modernizing our water systems in an explainable manner that makes the best use of our scarce resources.
Building a Machine Learning Platform to Scale Data Science
To address the challenges facing our water systems, a machine learning platform needs to be able to bring together a variety of algorithms.
For example, neural networks are able to do some things people do “intuitively,” like reading handwriting despite its inherent variability. This is the precise task that lies at the heart of some big water service line project: digitizing and sorting paper records.
Over the last several years, neural networks have proven themselves able to replace humans in this tedious, often maddening, task—even when the person writing was in a hurry and failed to cross some “t”s, dot some “i”s, or close the occasional “o.” We can now hand off the “grunt work” of reading and transcribing typed and handwritten municipal tap cards to a deep learning system, which can sort through massive amounts of messy data and unstructured signals to create usable datasets.
But, as anyone who has played with DALL-E can attest, if you want explainable answers, a neural network is not the place to go.
After the neural network has slogged through the tedium of making human records into machine-readable datasets, it hands the results off to decision trees and other algorithms. These can then begin to create maps and make easily explainable predictions as to where we should look for lead service lines.
Partnering with AWS to Create Enterprise-Grade Machine Learning Solutions
Machine learning data analytics is a powerful tool, and not yet a “turnkey” solution.
“Neural networks are very complicated,” Aaron Beckett notes. “It takes a lot of expertise to effectively deploy them, especially to a specific problem like tracking down lead water lines. Amazon Web Services (AWS) has already been involved in a lot of this work, and so we can inherit the benefits of that expertise. For example, they have a pre-trained service that can look at images of tap cards or written records and extract data. We can, in turn, pass those results to our own algorithms and machine learning models at BlueConduit. By working with AWS, we get to inherit decades of expertise. We can then focus on what we’re good at—while we benefit from what other people are good at. AWS is a scaling tool for me to do my job, as an engineer, just like I’m a scaling tool for data scientists to do their jobs. I can do more with less—I can build more and build better—because we’re leveraging tools that scale well using AWS. If we can make processes that scale well, then our impact is bigger.”
Aaron goes on to explain that while “AWS is the innovator, they’re the builders’ platform. … and also an ownership platform. They give you the best tools to own and maintain complex systems in-house at scale.” Maintaining those complex systems in-house is extremely important if you want to offer communities clear and explainable recommendations.
“AWS has a partnership mindset, as a company. Part of the way they deliver value to end users is to work with partner organizations that build on top of their platform. They understand how to have effective strategic relationships with people who help both sides, at the emotional and delivery level, not just to make money.”