Key questions in data science for sports

Over the last decade or so, there has been a Big Data revolution. This is true of our general lives; a good example is how Cambridge Analytica collected, both legally and potentially illegally, data of Facebook users for the targeting of campaign advertisements, but also within sport, where, in part thanks to the increase in technology, there is a vast amount of data available to sporting teams.

» Related content: Craig Pickering looks at three factors to consider in data collection.

Even more recently, teams have started to wake up to the potential power of this data, and have begun to bring on board individuals with expertise in data science to better be able to collect, handle, and analyze this data. These data scientists often come from outside sport; for example, Arsenal, a football club based in the UK, hired a data scientist last year who had previously worked on the launch of Candy Crush, the popular video game. Further examples include the Irish RFU, who recently advertised for a data scientist to assist in the analysis of GPS data to enable them to reduce their risk of injuries, and the New South Wales Institute of Sport (NSWIS), who employ a data science team enabling them to analyze their own data, and potentially create future prediction models.

I’m becoming increasingly interested in how this field is developing, particularly because I see huge potential in the area of prediction modeling and future forecasting. If, for example, you can utilize data to predict the workload of a training session (such as in this study), then you can modify the training session parameters to achieve the specific goal you’re after, before the training itself takes place. Other research has suggested promise in the use of data science techniques, including machine learning, in the accurate quantification of fatigue, prediction of muscle fiber type, and the prediction of injuries and post-injury recovery time. Recently, Mladen Jovanovic and Ivan Jukic used data science techniques to simulate optimal training methods, which could have important implications for training program design. And this is just the tip of the iceberg.

What questions can data help answer?

Clearly, this is an area of huge promise. Recently, I listened to a podcast with Sam Robertson, one of the leaders in the field of data science in sport. Robertson holds a dual role, both as an Associate Professor at Victoria University, and as Head of Research and Innovation at the Western Bulldogs, an AFL team. Although a traditionally trained sports scientist/coach, towards the back end of his PhD he saw that the use of data in sport held huge promise, and so upskilled himself to be able to take advantage of this.

In the podcast, Robertson identified three key questions that he views data science as being a potential game changer in enhancing performance:

  • How do we prevent injury?
  • How to we optimize a game schedule and periodization to balance injury and performance?
  • How do we quantify and understand a given player’s performance within a team, and then enhance this?

This last point is especially interesting, especially in terms of team sport; here, players interact with one another tactically, but are also influenced by the opposition tactics, as well as physical and psychological aspects such as fatigue and pressure, and the weather. By collecting quality data over time, data scientists will be able to model various outcomes, such that if a football manager is playing against a team, and knows the tactics of that team, the physical and psychological condition of his players, and the upcoming weather, then he can make a better decision as to what tactics to use (for a good open access review on this, check out this paper). Furthermore, the data scientists are able to analyze match play in real time, and provide constant feedback to the manager as to what decisions might be worth making. Finally, by utilizing predictive modeling in terms of match play, it should be possible to design more realistic training sessions, better preparing the athletes for the upcoming matches.

Subjective vs. objective data

Another area raised by Robertson in the podcast is that of comparing human, subjective data with more objective data. This is potentially important in sports where there is a centralized draft system, such as the AFL in Robertson’s case, but also the big-4 US sports. Are the subjective decisions made by the team’s scouts backed by the objective data? And if not, why not? If it turns out that scouting decisions are affected by aspects such as mood or fatigue, then clearly there is a good case for including the objective data in whatever model is produced and/or utilized; conversely, if the talent scouts perform better than the data driven model—which would likely only happen due to poor quality data—then greater emphasis can be placed on the scouts decisions.

Again, there is some emerging research in this area, but it is a fast-growing field. This can also come into play from the perspective of talent identification. In the podcast, Robertson gives the example of two promising players, who appear very similar, but that the team can only afford to offer a contract to one; can data science provide an avenue for identifying future talented performers, and, related to the previous point, does this outperform subjective data from scouts and coaching staff?

End-to-end data processes

Robertson also outlined the end-to-end process of data science in the podcast by Robertson. It is important to understand the whole process, because then you can start to see key areas teams must focus on:

  1. Data Capture – collecting the data that you are going to analyze. It’s important to ensure that the data actually measures what you claim it is (i.e. it is valid), and that it remains stable across time (i.e. it is reliable). If data is not valid (e.g. you’re using inaccurate sensors), or not reliable (e.g. the wellness scores provided by a player vary significantly, whilst his actual wellness does not), then the data is of too poor quality to use in your analysis and modeling; you will get conflicting results. As such, a role of the sporting data scientist is likely to conduct internal validity and reliability studies on all technologies used by teams and coaches.
  2. Data management and storage – important aspects here include the speed at which the data can be transformed from raw data, to analyzable data (which often involves data cleaning), and how quickly can this be analyzed in order to inform decisions that are made. This becomes harder the more data that is collected, providing a potential argument for sports teams and bodies to collect less overall data, but ensure that the data they do collect is of the highest quality.
  3. Data analysis – what patterns and conclusions can we pull from the data? This stage can include more simple analysis, such as regression, to more complex machine learning models such as neural or Bayesian networks.
  4. Implementation – this is the most important step; making sure your data informs decisions. This is where the human element gets introduced; can you report your data in a way that is easily understandable and actionable by the coaching staff, and ensure buy-in from the players and athletes? Data visualization also becomes very important here – if you’re interested in designing better visualizations in a sporting context, here is a good paper on that.

Moving forward in data science

Overall, I’m curious as to how this area will develop. One issue that I potentially see is that of ethics and privacy; for example, who owns that data that is collected, and what happens if a player refuses the collection of a certain type of data (such as sleep data). What if the algorithms that we develop are wrong, and incorrectly identify someone as lacking the ability to be a future elite performer?

We also cannot ignore the human factors in data science. Cathy O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy is an interesting introduction to some issues with Big Data. Nevertheless, given the likely benefits of data science within sport, it seems likely that such an approach is here to stay, and will be refined over time. Again, this represents another developmental challenge to today’s coach, and some level of upskilling will be required, but the potential benefits are huge.