Data mining is statistical deja-vu

Statistical analysis is an integral part of many fields, including sports. Horse racing analysts have been using data mining to predict future outcomes for decades. Data mining is a process that involves searching through large data sets to identify patterns, trends, and relationships. However, as effective as data mining can be, it can also lead to “statistical deja vu.”

However, as powerful as data mining is, it can also lead to “statistical deja vu.”

Using historical data

Let’s consider a hypothetical scenario to understand what statistical deja vu means in horse racing.

Suppose we have a data set of historical horse races. We can use this data to calculate various statistics such as finishing times, distances, jockey win percentages, how often the horse has won, etc. We can then use these statistics to predict the outcome of future races.

For example, we might see that a horse general trades at a much lower price in play and has traded 50% lower in five of it’s last six races.

At first glance, this approach seems reasonable. After all, if we know how well a horse has performed in the past, we should be able to make an informed guess about how well they will perform in the future. However, this approach assumes that the past is a reliable predictor of the future, and that’s where things can get complicated.

In many cases, taking a broad brush approach just isn’t good enough. For instance, was the horse running on the same ground, at the same course against the same opponents over the same distance. If not, the data you are looking at doesn’t contain much information. You are just looking at something that has happened.

However, there are also situations where the past may not predict the future well. For example, if a horse sustains an injury, their performance could decline even if its historical statistics suggest otherwise. The age and whether a horse is stepping up or down in trip and where it is in its career can also help.

This is where statistical deja vu comes into play. Relying too heavily on historical data to predict future outcomes can lead to a repetition of the past. We assume that what worked in the past will work in the future, even when circumstances have changed.

One of the risks of statistical deja vu is overconfidence. This can lead us to overlook important factors that could influence future outcomes.

Overconfidence

One of the risks of statistical deja vu is overconfidence. If we’ve successfully predicted outcomes using historical data, we may assume that we’ll continue to be successful in the future. This can lead us to overlook important factors that could influence future outcomes.

For example, let’s say we’re trying to predict the outcome of a horse race. We may analyze historical data to determine which horse has a better finishing time. However, we may overlook other factors such as the track condition, jockey experience, race-specific competition and weather conditions. These factors can significantly influence the outcome of a race, but we may not consider them if we’re too focused on historical data.

However, this doesn’t mean that data mining is useless for predicting the future in horse racing. Data mining can be a powerful tool to identify patterns and relationships that may not be visible otherwise. To make accurate predictions about the future, we must consider various factors and be willing to adjust our approach as circumstances change.

Data mining can be a powerful tool for identifying patterns and relationships we might not see otherwise. However, it’s important to remember that historical data is just one piece of the puzzle.

Conclusion

When I use data, I use it to model.

So if I’m trying to predict the outcome of a horse race, I will mine data, but only to find the relationship between factors that I can use in my model. Each factor makes up a percentage of that model. Trainer form is a certain percentage, current form, going preference and many other factors.

I’ve identified 19 factors that describe 95% of the difference between the two teams in football. Each one is weighted to certain factors appearing in various data bits. When I have all that, I can use some maths to project forward and give myself a percentage. Of course, I’ll look at historical data, but that’s only part of the model and is carefully weighted.

So data is helpful, but only when used correctly. But my experience of data services and advice is that they are generally not used correctly, else they would generate percentages of prices and assign weighted values to some aspects of data. That’s a flaw and guides people down the simple but useless and lazy route.

In conclusion, statistical deja vu is a real danger in data mining. You wouldn’t drive your car forward by looking in the rearview mirror?

When we rely too heavily on historical data to predict the future, we risk repeating the past and overlooking important factors that could impact future outcomes. However, this doesn’t mean data mining is useless for predicting the future. By considering a wide range of factors and being willing to adjust and weighting that data, we can use data mining to make accurate predictions and gain insights that might not otherwise be possible.