I interviewed BlueConduit’s VP of Data Science, Alice Berners-Lee, to learn more about how BlueConduit’s expert Data Science team thinks about and defines accuracy in predictive modeling. This interview has been edited and condensed for clarity.
Elana Fox: Tell me about predictive model performance.
Alice Berners-Lee: When we’re training a model, the goal of evaluating the model performance is to try to simulate and test how the model is going to be used when it actually goes out into the real world. So we do a lot of things like training the model on some of the data while holding out a certain amount of the data, and then testing the model on the data we held out. And we do that in a sort of random way where we’re just iterating and looking at how the model performs across a lot of different metrics, including accuracy, recall, and precision. And then we also test the model in a more specific way to test on certain features, and sometimes this is something that is different across customers.
Depending on the situation there might be other things that you want to test that your model can generalize away from so that, if you have a model that has a really high dependence on one or two of your features, you can check to make sure that those features are available in the real world.
Let’s take a silly hypothetical example. It could be that – and maybe you don’t know this – when field verification was happening, the people doing the verifications also knocked on the homeowner’s door and asked them how many bedrooms they had (and that’s your only way of knowing how many bedrooms there are in the house). Then, you have this good information about how many bedrooms there are in these homes, so you build a model and you say ‘wow, you’re way more likely to have lead if you have fewer than four bedrooms’ or something like that.
And then you go and you bring that to the rest of the community but you don’t have any idea how many bedrooms any of them have. So when you’re doing your training and you’re testing in your model, you’d come out with really good accuracy because all of your data had that information. But then when you go out into the real world, that information is actually not useful at all. You need to make sure you have everything that you need to be able to generalize across different situations in the rest of the population.
Elana Fox: You mentioned 3 metrics – recall, precision, and accuracy. Can you define them for me?
Alice Berners-Lee: Recall says, out of all of the lead lines, how many did you find? Recall is really important for lead service lines because we don’t want to unintentionally leave lead in the ground.
Precision then asks, out of all the times you said there was lead, how often was there actually lead? This is exploring the false positive rate.
It is important to look at precision and recall side by side because it is easy to optimize one if you ignore the other. For example, if you have a water system with 100 service lines and 10 of them are lead, you could optimize recall by predicting that every service line is lead. In this example, you found all 10 of the lead service lines. But then when you turn to precision, you’ll have terrible precision because you predicted a lot of lead that didn’t actually exist. And that has big costs in the real world when you think about the expense and time needed for digging and replacements. When we ship our models, we work with the customer to understand what matters most to them and to optimize across both precision and recall.
For me, when I think about accuracy, the real test is when you actually start digging and test the predictions in the real world. We measure this using hit rate; hit rate is recall out in the real world and measures how often a lead pipe is actually found, in the ground, when it was predicted to be there.
Elana Fox: When you are building the models for each customer and testing model performance, how do you decide which model to use?
Alice Berners-Lee: It’s a combination of a lot of things. A lot of times, in a very early stage of the modeling, we are comparing a few models and asking “why aren’t these performing in a certain way?’ We then go back to looking at the features and the dataset, and we try to improve the data in some way. And then, based on that adjustment, we get our next batch of models. And we continue to compare the next batch to the last batch.
And then when you are looking at the same batch of models that has the same underlying data, and it doesn’t have any new features that you’ve engineered or anything like that, then it comes down to differences in parameters of the model and differences in metrics. Often it is pretty obvious that one of the models is performing better than the others across all or most of the metrics. But sometimes that’s not true and there are a handful of models that are doing really similar things. Honestly, then it depends on what might matter more in terms of the tradeoff between precision and recall, or maybe what the models are being used for in the real world, etc. Or maybe we don’t care that much about the calibration right now…
Elana Fox: What is calibration?
Alice Berners-Lee: Calibration is the idea that the probability that you are assigning is actually representative of that true probability. One of the important takeaways from any model that we deliver is that you can list the probabilities from highest to lowest and then that’s the customer’s priority list, assuming they have no other concerns beyond probability of lead. And that list could be from 0.98 to 0.02, but it could also be from 0.5 to 0.2 and you wouldn’t care because it is just the highest priority to the lowest priority. But if you care about what that probability means, then you care whether those top ones are a 98% chance versus a 50% chance. So you can see how calibration is really relevant for our customers, especially related to compliance for their Service Line Inventories.
For example, say you take 10 houses that have a probability of 0.1, only 1 in 10 of them should have a lead service line. Not 0 in 10, and not 2 in 10. So if you take a group of 50 houses that have a 0.5 probability of lead, 25 of them should have lead. So we look at these curves to see if the model is over predicting or under predicting. And you can have non-linearities in there, too. When you have a good model, you should have a good match across lots of iterations of this type of sampling that the probability you assign a single point, on average, turns out to be true across a lot of points.
But you have to look at the average. If a home has 0.6 likelihood of lead and then you dig and it isn’t lead, that prediction can still be very accurate if you look at 9 more homes that have a 0.6 predicted likelihood of lead and, across all 10, 6 of them are actually lead. So you have to look at large groups of predictions lots of times to get lots of different samples to really assess calibration.
Elana Fox: This leads me right to my next question. What do you wish water systems better understood about accuracy in our work?
Alice Berners-Lee: It is helpful when customers have specific concerns that then, when we hear about it from them, we hear a metric.
Recall, to me, seems like a metric that a lot of customers care about because we don’t want to leave lead in the ground. But sometimes customers will say something that makes me realize, oh they’re actually really worried about the false positive rate; they don’t want to dig up anything that isn’t lead. So then you know to keep a close eye on that.
The more specifically a customer can talk about what they are going to use the model for and what their priorities are, the more we understand their needs and are able to give them metrics to help them understand where they are at in regards to what they are specifically worried about. This is a much more meaningful way for us to engage with our customers than when they just ask “what is the accuracy of the model going to be?”
Learn more about using predictive modeling for Service Line Inventory material classifications and LCRR compliance. Schedule a consultation today.