Predicting sprint performance through data modeling

One of the “Holy Grails” in sport is the ability to predict, with accuracy, whether someone has the potential to become an elite athlete or not. I’ve covered this in previous articles and papers in terms of genetics, discussing whether we can test for it or not and how we might think of talent in terms of the ability to respond to training. However, at present, predicting future performance remains very difficult. But we keep trying and a recent paper in Biology of Sport took a novel approach to trying to predict sprint performance. The researchers recruited 104 Croatian sprinters and collected a wide variety of data points relating to anthropometric, genetic, and psychological traits to create a rich data set for analysis.

First, the fine print

Before we discuss the findings of the research, there are a few caveats that need to be addressed. In the male sprinters, the average 100m personal best was 10.98 seconds, and, for the females, it was 12.28 seconds. If we’re being honest, these are not elite sprinters. This is important to keep in mind, because it’s likely that the “predictors” of elite performance vary at different performance levels. I’ll touch on this again later on.

Secondly, what the authors did in this study was attempt to develop a model that explained the relationships between various data types and 100m performance. This is not the same as prediction; for a true prediction to occur, you’d need to then use this model in a hold-out data set, to see how well it truly predicted sprint performance, an approach I explored in this article.

Thirdly, the authors utilized correlations utilizing linear regression. To calculate the r-value (or correlation coefficient, to give it its proper name), you essentially plot the data point and 100m personal best time on a graph, and draw a straight line of best fit, with the r-value telling you the general slope, and, as a result, the strength of the correlation. An r-value of 0 means no relationship between the data points; 1 means a perfect positive relationship (as one goes up or down, so does the other), and -1 means a perfect negative relationship (as one goes up, the other goes down, or vice versa). Of course, in real life, you don’t get correlations of 1 or -1, and, in general, a strong correlation is one where r = 0.7. Naturally, you don’t need reminding that correlation doesn’t equal causation, but it might be worth pointing out that correlation coefficients also don’t tell us the direction of relationship.

Finally, linear regression, as the name suggests, explores linear relationships. Not all relationships are linear. For example, there is a relationship between sprinting and bodyweight: very light athletes and very heavy athletes tend to run slower. But if you plot out the relationship in a linear manner you don’t see anything since the light and heavy athletes would essentially cancel each other out. This shows the limitations of linear models in working with the many non-linear situations in the real world.

What we can learn

Those caveats aside, I really like this study, not necessarily because of what they found, but because of its approach to the problem. They’ve taken quite a rich data set and attempted to explore relationships, and the authors should be congratulated for that.

What did they find? For both men and women, there was no relationship between faster sprint times and genotype for two genes, ACTN3 and ACE. These two genes are the most well-researched genes in terms of sprint performance, with ACTN3 often labelled as the speed gene. Elite sprinters are far more likely to have the RR or RX version of this gene, with very few elite sprinters having the XX genotype. So, for there to be no relationship between these genes and sprint performance might seem surprising, but, if we think about it, it isn’t. The overwhelming majority of genetic variants have only a very small impact in and of themselves. In “normal” people, this impact is lost in noise. However, in extreme outliers, such as elite 100m runners (i.e. those that can run under 10 seconds), who have a large training history and broadly similar training programs, the noise is much less. As a result, the effect of the given genetic variant can be more pronounced. So, for 11 second 100 meter runners, ACTN3 likely has no real measurable effect, but in elite 100m runners, it might make a couple of hundredths difference.

In terms of the other data points, there are some interesting findings. For both genders, there weren’t any strong correlations between any variable and 100m performance. For men, the strongest correlations were between bicristal diameter (pelvic width), with wider pelvis weakly correlated with faster sprint times. Faster men also tended to (although, again, with only a weak correlation) have shorter legs and feet, and higher feet. Clean 1RM and standing long jump were moderately negatively correlated with sprint times (i.e. better clean meant faster times). For females, the faster sprinters tended to have shorter legs and feet, were more likely to report lower levels of state anxiety, and tended to be more self-confident. Like the men, females who had better power clean 1RM and standing long jump scores tended to be faster.

These results are somewhat confusing; the fastest males had wider hips, and yet we know that one of the reasons males are faster than females is because they have narrower hips, so, in general, we would expect narrower hips to be associated with better performance. Similarly, we know that perhaps the main differentiator between elite and non-elite sprinters is their stride length. Whilst not a perfect relationship, in general we’d expect to see that longer legs equate to a longer stride, and yet this study found that the better sprinters had shorter legs. The reasonably good correlations between strength/power and 100m time do suggest that strength and power are decent determinants of sprint performance at this level, but it would be interesting to see if and how this changed in higher level sprinters; is there a general threshold above which further improvements in strength and power don’t necessarily correlate with improvements in sprint performance?

The next step in predicting performance

In summary, I like the idea of this study. I think that it takes a novel approach to attempting to understand sprint performance, and we get a broad idea that anthropometrics are important (which, given that anthropometrics are partially genetically determined, suggests that genetics do matter), as are strength, power, and some psychological aspects. It tells us that, at present, genetic information likely doesn’t assist in the identification of high-level athletes. But, as I mentioned earlier there are a number of caveats to keep in mind.

Future research will hopefully build on this study by recruiting truly elite sprinters (which is hard, as there aren’t many of them), and employing non-linear analysis methods to better understand the relationships between variables and performance. Then, we will start to get a better understanding of the true determinants of elite sprint performance, which might assist in talent identification processes in the future.