When a Representative Sample of Service Lines Reveals Zero Lead, What Can You Say?

It’s easy to find evidence for a risk of lead exposure from water service lines: once you find a lead service line, you know there is some risk. But how do we show that there isn’t a meaningful risk of lead in our water system?

In some communities, water utility personnel with decades of experience have never come across a lead service line in their water system. They can then go on to complete hundreds of hydrovac inspections and a representative sampling of their tens of thousands of water service lines, and still not find any lead. Such scenarios are more often the case in states with many newer communities, where lead service lines have been less common. Those communities rightfully wonder when it’s safe to stop digging holes and focus their limited resources elsewhere to better protect public health equitably.

Given that BlueConduit’s data science, including its predictive modeling, has been used to help many communities accurately estimate how many lead service lines they have and where these are likely to be located, it’s fair to ask a few questions, which we have heard across water utilities, engineering firms, and state regulators alike, and which we will address here:

How can you prove 100% confidence that my water system does not have any lead service lines?

You can’t! But that’s not the fault of any method; it’s just a fact. You cannot have 100% certainty about this issue. We prefer to talk about how you can characterize that uncertainty with the data available, in order to make better decisions.

We would prefer to say that data science can help many communities to show how few lead service lines remain. But while we can assist such communities, we should call this a statistical analysis approach, rather than predictive modeling.

The terms “predictive modeling” and “statistical analysis” are often used interchangeably, but they are often referring to different or even overlapping methods. Not all statistical analyses are predictive modeling.

That’s because predictive modeling builds a model based on examples of each of the verified service line materials you want to predict (e.g., lead and not lead) — and the entire problem here is that, despite their best efforts, these utilities have not found any evidence of lead service lines at all.

Ultimately they still need to say something to their state regulator and to the public about the presence and absence of lead. But at how many service lines should they dig up to search for that first lead pipe?  Where should they choose to dig to build the strongest evidence that there is no lead in their system?  Fortunately, BlueConduit has other tools and approaches that can help you determine when it’s likely safe to stop looking for lead.

When a representative sample of service lines reveals zero lead, what can you say? 

We all know that “the absence of evidence is not evidence of an absence.” That truism is the cornerstone of good research, healthy business operations, and fair governance. 

Meanwhile, in your areas of expertise, your lived experience has given you a sense of when you’ve looked for something long enough, or when you should keep digging. 

Consider a water system that has 30,000 service lines. The utility workers have said they have never seen a lead (which for the purposes of this piece will include if any portion is a lead service line, galvanized service line, or lead gooseneck). Already, 10,000 service lines have reliable verified materials for both portions that are not lead (including some installed before 1986 and all installed after 1986). So, the 20,000 service lines with materials that are not fully known can be considered “unknowns.” 

Workers have a variety of reasons to explain why there would be no lead service lines. For instance, the following are some examples of reasons you may encounter. Most development happened in the 1950s and 1960s where they recall hearing lead was no longer in favor by local contractors, with a new expansion in construction in the 2000s. Some people are worried about another neighborhood that historically had water quality issues in the 1990s. And others pointed to the older homes built during the 1940s, when they know lead was often used in their state. Assuming law abiding construction and plumbing installations, you can exclude homes newer than 1986 (or an earlier date in your locality), as long as you have reliable data on the year that public- and private-side service lines were actually installed. 

How confident is the utility in these ideas that they would be able to conclude to the public and the regulator that they have nearly no lead service lines? Or what about how confident are they to say, “There are fewer than 300 lead service lines among the 20,000 unknowns”?

The best way to answer these questions, based on established practices in statistics ranging from political polling and the Census to market research and medical clinical trials, is to start with a representative sample. The purpose of the representative sample is to reflect the greater mix of service lines that have not yet been verified across your service area’s population.

So, they randomly select 200 unknowns, giving each of the 20,000 unknowns an equal chance of being inspected. They inspect those 200 service lines and find zero examples of lead on any portion. Then the problem is: how can you estimate the amount of lead in a system when it has never been detected?

What can we conclude about the lead service line inventory using data science?

Fortunately, this is not a new problem. One way that it’s been addressed in other fields (like clinical studies for medical treatments where it is important to uncover rare side effects or adverse reactions) is to treat the count of these rare events as a Binomial Distribution. This is similar to tossing a coin several times, in that for each coin toss, there can be only one of two outcomes: the service line is lead, or it is not lead. With this approach, we can rely on a body of decades of existing statistics work to utilize the most productive way to think about situations where information is more scarce than we’d like. 

We’ve developed a methodology using the representative sample to estimate the level of risk throughout the system. 

  • First, after evaluating your system and data, we support you with making decisions around where to inspect and how many investigations should be performed. This recommended number is based on best practices in statistics, or to align with your state’s requirements for the use of statistical methods. 
  • Next, we incorporate the results of those field verifications as well as the desired level of confidence (e.g., 95%). The accepted standard confidence level across several states is 95%.
  • As a result, BlueConduit can provide you with statistical documentation to justify the highest possible percentage of lead services lines you could reasonably expect to find in your water system. We can also help you understand what this means for the possibility of lead at the address-level for each of your unknown service lines.

Returning to our example above, after they perform 200 inspections and find no lead. We can evaluate the data to say what the maximum number of lead service lines there likely are in the unknown sample. Maybe with 95% accuracy we can say there are no more than 300, but with 99% accuracy we can say there are no more than 690.

And as they continue to gain more information, say, another 50 representative service lines are found to be not lead, then they can update those numbers to be 95% certain that, say, fewer than 240 are lead.

Data Science Can Inform Action

The absence of evidence of lead alone cannot “prove” an absence of lead altogether—but it’s an important part of the picture.  

When a representative set of unknown service lines are inspected (on both portions of the service line) and when not even one of them turns out to be lead, then we can characterize our uncertainty about that water system’s remaining unknowns with our methods.  These calculations can guide community and utility action even when it is very likely that there is a very low number of lead service lines in the system.

First, the system should continue its focus on protecting public health through regular water testing at homes and schools, and through coordination with departments of health. Locations where testing reveals elevated water lead levels should have their service lines inspected.

Second, if possible, inspecting service lines should continue at representative locations, using methods such as hydrovac or potholing, will build the amount of evidence and lower the uncertainty around the absence of lead.

Third, further service line investigation in communities, even where no lead has been discovered, might also include in-home surveying, where customers are asked to head down to the basement and visually inspect their service line connections entering their homes, perhaps testing with a scratch test. This can be coordinated with public communication and even collected through crowd-sourcing via mobile uploads. 

But even approached holistically, the result is never going to produce complete certainty. No one can “prove” there are no lead service lines in a water system without digging up and visually inspecting every single length of every single service line. Water systems and regulators alike recognize that 100% physical verification is not feasible from a time or cost perspective. 

Several states have adopted the use of these methods to support inventory development with regards to sufficient evidence for non-lead.

Clear Communication, Transparency, and Caution

What we can do is bring together data science, machine learning, good statistical analysis, and a lot of experience in many diverse communities across the United States, and share those best practices with communities.

Communication should involve community partners. Utilities should take care in coordinating with local community organizations to explain what is known, where work is happening, and what is not yet known. Clarifying the remaining uncertainties, inviting community members to participate in further distribution information, and engaging with collecting more information about service lines have combined to be an effective approach in a number of service line inventory and replacement programs to date. This is important when there is concern about lead but no lead has been found.

Utilities are best set up for success when they explain to community members and regulators the measures they’ve taken and the underlying assumptions that allow them to conclude something like:

“We have conducted more than 200 hydrovac inspections of a representative sample of service lines and found none had lead. A representative sample means that each service line with an unknown material had an equal chance of being randomly selected to be inspected. Based on those 200 inspections, and a standard statistical calculation, these numbers suggest there is 95% chance that less than 1.5% of the unknowns in the system are lead.”

Once all of the context is clear, if this is an acceptable threshold to their key stakeholders, then they can begin to allocate resources elsewhere, while still remaining prepared to re-examine those assumptions and change course as new data becomes available.

Here’s How BlueConduit Can Help You

  • We can analyze your records and associated data
  • Ensure proper and representative field investigations
  • Generate statistical results and predictions for informing your inventory
  • Provide documentation for your regulator/state primacy agency

If you would like to learn more about how BlueConduit can support your water utility, contact us or schedule a demo!

About Alice Berners-Lee, Ph.D.

Alice Berners-Lee is the Director of Data Science at BlueConduit where she designs machine learning models for customers' data. She's passionate about using her scientific training to help communities and the environment. She has a PhD in Neuroscience from Johns Hopkins School of Medicine, and before joining BlueConduit she completed postdoctoral fellowships at both University of California, Berkeley and Harvard University.

Jared Webb

About Jared Webb

Jared Webb is BlueConduit’s Chief Data Scientist. His responsibilities include processing and analyzing customer data, managing relationships with technical service partners, and producing output of Machine Learning results. He has been a member of Dr. Schwartz and Dr. Abernethy’s team since 2016 and has served as Chief Data Scientist since the formation of BlueConduit. Jared received his Undergraduate and Masters in applied mathematics from Brigham Young University, where he focused on the mathematical foundations of machine learning models.

Eric Schwartz portrait

About Eric Schwartz, Ph.D.

Eric Schwartz is a professor of marketing at the Ross School of Business at the University of Michigan. With over 10 year’s experience in data science and predictive modeling, he is a pioneer in the realm of predicting customer behavior. Recently, his focus has been on applying his expertise in data science for public good, beginning with work on water quality and infrastructure issues related to remediation efforts in Flint, MI. His efforts have been recognized by many as the standard for identifying LSLs. Schwartz is a co-founder of BlueConduit.