Back to Blog

Blog

Why Let a Computer Tell Me Where to Dig? Separating the Benefits of Data Science from “AI” Snake Oil

Blog
April 4, 2023

As an increasing number of non-specialists have gained access to AIs like ChatGPT and DALL-E, many of us have begun to wonder:

“Is ‘machine learning’ just digital snake oil?”

At first glance, these machine learning models seem incredible: a simple prompt and a click of a button yields mind-blowing art and ready-to-run articles—who knew that so many celebrities had advanced degrees! According to the most prominent content-generating AIs, Ryan Reynolds has an MFA in creative writing. (Spoiler alert: He doesn’t.)

But the shine wears off quickly.

In fact, reading articles written by AI gets to the crux of what has many worried about fast AI adoption: more often than not, AI-written articles are flat-out factually incorrect: (e.g., Despite what the AI chatbot says, Reynolds dropped out of Kwantlen Polytechnic University without a degree, or having completed any classes.)

With this kind of track record, it’s entirely reasonable for those in charge of a city’s water resources to be doubtful about the benefits of machine learning when it comes to reliably tracking down lead service lines (LSL).

But, given the scope of the environmental and water quality issues across the United States, we need all the help we can get. We don’t have the resources to spare on digging up a service line only to discover it was replaced 20 years ago. And, as a society, we can’t afford to accidentally leave lead in the ground.

Good predictive analytics can help us sift key insights out of big data, helping us better choose where to dig to do the most good as efficiently as possible.

Addressing the Shortcomings of Predictive Modeling and Artificial Intelligence

Jared Webb, Chief Data Scientist at BlueConduit, understands the hesitancy among some water management professionals when it comes to machine learning and predictive modeling.

“When you say, ‘We’re starting in this neighborhood instead of that neighborhood because of an artificial intelligence model’—people are right to be skeptical. It’s not like AI and big data have lived up to their promise in a lot of ways.”

But communities working to address water quality issues (often of unknown severity) need more tools. Efficient lead service line replacement is, in addition to a large physical task, also a data mining exercise. It demands the analysis and integration of often very messy datasets: paper records, customer data, building history, demographics, water samples, and more.

As Aaron Beckett, Software Engineer with BlueConduit, notes “Having experts use their human eyes to analyze all of this stuff doesn’t scale well. Using predictive modeling—and especially a predictive model that gives clarity into why it made the guess it did—that scales better, because it can be automated. If we can make processes that scale well, then our impact is bigger.”

Importantly, machine learning will help you do so more quickly and less expensively than existing human resource-intensive solutions.

Doing Data Science Right

The reason that most of the AI people are playing with right now proves disappointing is because it isn’t deployed strategically. When you start chatting with an AI-powered chatbot about Emily Dickinson’s favorite football team or even her football career you’re getting relatively raw and unfettered output from a neural network trained to speak confidently if not accurately.

That’s not how BlueConduit uses AI. BlueConduit carefully applies a collection of technologies and strategies to LSLR and water resource problems, applying the right machine learning approach to the right part of the job.

For example, BlueConduit uses neural networks (similar to what’s driving those chatbots and AI “artists”) for “intuitive” tasks, like pre-processing and transcribing messy paper records and hand-written tap cards. Those transcriptions then join other datasets being fed into regression or classification models that work to classify, integrate, and evaluate all this structured and unstructured data. From there, the algorithm can begin to make predictions.

Collections of these teams of algorithms are brought together as “ensemble models”—aggregate predictive models constructed of different combinations of predictive models.

Most importantly—and in contrast to the “novelty” AIs busily making smeary art and writing wrong blog posts—no one at BlueConduit treats the result of these initial predictive analytics as the final product. It is the first step in an iterative cycle. That initial dataset is tested against real-world observation guided by sound statistical methods (i.e., actually digging up service lines in areas where establishing ground truth will give us the most useful new data). Those findings are then continuously fed back into the predictive model by BlueConduit data scientists, so that the model can find new correlations and insights, steadily improving its predictions.