Everyone Into the (Data) Pool
More data in your models isn’t always better December 2023Photo: iStock.com/Andreus
In 1998, John R. Mashey—then the chief scientist at Silicon Graphics—gave a presentation in which he became one of the first people to use the term “big data” the way we use it today. That now seems quaint. By most estimates, we store at least 1,000 times more data than two decades ago. And we don’t just have more data at our disposal today—we also have more powerful tools to make use of it. For example, with predictive analytics, it is often possible to spot patterns in high-dimensional datasets that no human could discern.1
Often, data scientists build a predictive model to exploit an existing dataset. But as machine learning (ML) becomes ubiquitous, we are increasingly likely to incorporate additional features from new datasets into existing models. There are good reasons to do this. Sometimes, adding a dataset provides insight into associations that would be undetectable if data had remained siloed.2
With our general enthusiasm for more data and those potential synergies in mind, it’s easy to think that if one dataset is good, two (or more) datasets would be even better. But if we keep accumulating data, will there come a point at which the essential risk-pooling principle of insurance breaks down?3
This probably won’t happen. However, we may have to change how we conceptualize risk pooling in the insurance domain. Traditionally, we have considered a risk pool to be a set of people all sharing common characteristics subject to a specific risk. That is, there may be risk rating within the pool, but it is performed over a set of homogenous subpopulations in the pool and broadly categorized by gender or age, for example. For each subpopulation, the set of individuals is homogenous with respect to risk classification.
With enough predictors, this inevitably breaks down. Instead, every individual in the pool will have a unique risk assessment (i.e., tailored prediction) due to a distinct combination of features that differ from others in the pool. Each person is unique from a risk perspective. However, this does not mean there is no pooling of risk among common features. Indeed, the only way we can determine the impact of any attribute is through pooling. Because even though no two individuals in the risk pool have the exact same set of features, it remains true that each feature and how it contributes to the final prediction is determined via estimates derived from pooling many observations together and estimating the effects of those features—that’s how insurance works, and that’s how data science works. We may keep getting finer predictions, but this has a practical limit.
Likewise, adding data to a model doesn’t always improve its performance or justify the costs—financial and otherwise. Instead, as data is added, one may find it harder to improve predictions significantly because we sometimes seem to hit a limit of predictability as we incorporate more data types. So, our pools should continue to get finer and our predictions better—this seems baked in at this point—but we are not destined for a future in which an actuary or underwriter shakes a magic eight ball and declares the day you will die.
I’ve encountered this diminishing marginal utility of data frequently in my career. I ran across it again recently when performing a research project with a team of data scientists and actuaries to test whether adding consumer data or credit attributes to an existing model (one that uses prescription and medical claims history to predict group health claims) would improve the model’s ability to predict claims costs. While good reasons exist to incorporate new datasets into existing models, as mentioned previously, adding more data doesn’t necessarily improve predictive performance.4
The kind of consumer and credit data I examined have proven useful in insurance products that one might intuitively think are adjacent to group health. For example, in long-term care insurance, carriers use consumer data-driven models (along with other data elements) to identify candidates for health interventions. The use of credit data is common in life insurance, and consumer data has been shown to predict hospital inpatient stays and emergency department use. So, it was not farfetched to imagine that enriching our model with these new data types might improve its performance or yield more granular insights than claims data alone. Yet, after adding hundreds of features and retraining our predictive model, the performance improvement was almost too small to measure.
You might conclude that those variables were not predictive of the outcome. (Admittedly, they are not as predictive as medical data.) But when we looked at the impact of consumer or credit data in isolation, they were at least somewhat predictive of the risk. So, suppose a model using only consumer or credit data would help predict morbidity and costs in group health insurance. Why wouldn’t adding these additional datasets offer at least some marginal benefit?
Logically, we know that the more accurate a model is, the less room there is to improve its predictive power. Still, it may not be quite as simple as diminishing marginal utility. Often, a better explanation comes down to redundancy among the datasets. Not superficially—the datasets have completely different attributes—but in a more fundamental sense.5
It’s Raining Information
Suppose I want to predict whether it’s currently raining. I could build a model with two features. The first feature is whether water is rolling down my window; we’ll call that W. The second is whether my spouse returns home with a wet umbrella; let’s call that U. With these two features, I can predict the variable, “Is it raining?” with almost 100% accuracy (of course, there are other, much less likely reasons both the window and the umbrella are wet). But how much of that prediction is attributable to either feature?
You could argue that if I just used the umbrella data but not the window data, I would still have almost 100% accuracy, so W must not be contributing anything as a feature. However, that cannot be right because if I saw only the window but not the umbrella, my accuracy would be unchanged, suggesting that feature U is superfluous. As a modeler, the best course of action is to recognize that both features derive from the same underlying concept; the latent variable is whether water is falling from the sky in my front yard (which is almost always rain but also could have come from a sprinkler or a kid with a Super Soaker). Knowing the two features are redundant, you can arbitrarily choose either feature, kick the other out of your model, and get on with your life.
If you’re with me so far, let’s examine a less simplistic example. In real life, we often have multiple features that each contribute new information and some redundant information in ways that may not be obvious. This comes up when adding new datasets to an existing model because even if the new data is valuable in isolation, it may not add value to a model that already works. You can think of this quasi-mathematically: Each variable contains shared latent concept(s) + independent information + error.
Imagine a plausible scenario in which a high credit score and routine dental visits predict health outcomes. The latent concept here is hard to define, but we could label it “high conscientiousness.” A person who manages both their oral and credit health probably also manages their physical health. The key here is that although there is no causal relation between these activities and health, they can correlate with the target variable in similar ways.
These ideas form the underpinning behind a commonly used technique in ML called SHapley Additive exPlanations (SHAP) values, as presented in “A Unified Approach to Interpreting Model Predictions” by Scott Lundberg and Su-In Lee. A SHAP value measures the predictive value of a model feature by averaging its contribution to the prediction across all subsets of model features. This approach helps address the ordering problem we observed in the earlier examples because the SHAP value is the contribution of a feature averaged across all combinations of features.
For instance, to calculate the SHAP value in our first example, we have three possible subsets of features. So, we would calculate the impact of the individual values of U and W in isolation and their combined value. They would share the same SHAP value because in all three scenarios—U, W, U + W—the prediction is equally precise. Our second, more realistic example involving oral care and credit scores would be slightly less clear-cut. Still, to the extent those two variables again represent a shared latent concept (conscientiousness), they might divvy up the SHAP values between them without increasing performance overall.6
So, SHAP values may be a great way to measure contribution. But, in isolation, they are not a good tool to measure the efficacy of new data. You can add a bunch of new data to a model and—even if many of the features are high in importance—your model metrics might not improve. In such an instance, the other variables simply have their importance reduced. All that’s really happening is the importance of the same latent concept being split between the old and new features.
The nature of complex models makes it hard to know in advance whether adding any particular dataset to a working model will improve its performance significantly. It typically would be best if you went through the hard work of adding it to your full dataset and running a model validation metrics suite to find out.
That’s not always easy. Running a test might involve purchasing data, and it will always take time and effort to preprocess the data and retrain the model. Additionally, since data scientists are in short supply, the work involved in adding a new data source often comes with the significant opportunity cost of delaying other important projects—not to mention the employment costs. And although you might expect some economies of scale, in my experience, adding a second data source takes about twice as long because another data source doubles the number of things that can go wrong and can seriously complicate data pipelines.
Since those costs are substantial, it’s important to have a framework you can use to evaluate whether it was worth it. Sometimes, improving a correlation from 0.79 to 0.81 is significant, but in other cases, that may not be substantial enough to warrant investment. So, it’s useful to have a way of translating absolute model performance into a metric relevant to the business in question.
Opportunity Costs Matter
A common problem in my work is evaluating data’s predictive and economic value from external sources for use in life and health underwriting. Since these data come with significant costs, the return on investment (ROI) is a primary consideration. I am sometimes called upon to demonstrate the data’s ability to drive substantial savings in claims costs via improvements in underwriting accuracy or reduce overhead costs by improving underwriting efficiency. Determining ROI may involve placing a value on less quantifiable factors, such as ease of use.
A failure to offset the costs or provide an acceptable ROI is not the only potential downside to adding new datasets—even if the new data offer a nominal improvement in model performance. One downside is interpretability. It’s almost a truism that every time you increase the number of features in a model, you make it harder to explain the model’s decisions. This aspect of predictive modeling is already on regulators’ radar and is increasingly a general public concern.
Another potential downside is that adding features can affect your model’s fairness. Said another way, if the model’s predictions are used in a way that inherently causes biased decisions toward certain groups of people, then the model’s fairness may be compromised.
Bias in models can be hard to measure. It’s another topic of interest to both regulators and the public. One approach to minimizing bias is to look at each feature in isolation for potential explicit bias. Obviously, the more features your model considers, the more time-consuming this prescreening will be. Furthermore, while features in isolation may not be explicitly biased, their combination with other features may create latent features that are proxies for bias.
Another approach is to run the model and analyze the predictions for bias. Eradicating bias then may involve a time-consuming process of removing one feature at a time and rerunning results to see whether doing so reduced or eliminated bias.
Thus, while it’s often possible to improve the predictive performance of a model by adding new features and feeding in more data, you should not automatically assume that adding data to an existing model will improve it. Even if adding features improves a predictive model’s performance, the improvement might not justify the cost or effort of adding them. Additional features will almost certainly make it harder for you to explain your model’s output and may add bias to your results, raising the risk of regulatory and reputational problems. That’s why you need a framework to evaluate the ROI in data, time and effort, as well as the potential downsides vis-à-vis interpretability and fairness.
Suppose you already have a rich data source with many features. In that case, it may be more fruitful to focus on performing additional domain-specific feature engineering to extract more value from the existing data. By doing so, you can improve the accuracy and performance of your model rather than simply adding more data that may not provide new insights.
In the end, more data is not always better where predictive models are concerned. It comes down to whether the new data provides truly new information and whether it’s worth the cost.
Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries or the respective authors’ employers.
References:
- 1. What Is Predictive Analytics? Google Cloud. ↩
- 2. Train a Supervised Machine Learning Model. OpenClassrooms. ↩
- 3. Big Data. National Association of Insurance Commissioners, October 20, 2023. ↩
- 4. Supra note 1. ↩
- 5. Brownlee, Jason. Model Prediction Accuracy Versus Interpretation in Machine Learning. MachineLearningMastery.com, August 15, 2020. ↩
- 6. Blotch, Louise, and Christoph M. Friedrich. 2021. Data Analysis With Shapley Values for Automatic Subject Selection in Alzheimer’s Disease Data Sets Using Interpretable Machine Learning. Alzheimer’s Research & Therapy 13, no. 155. ↩
Copyright © 2023 by the Society of Actuaries, Chicago, Illinois.