How we grouped 1,602 tennis players to find out who is more similar to Stefanos Tsitsipas and to predict the latter’s career path over the next decade, by applying simple machine learning techniques and statistical methods.
Which players the 22-year-old international is similar with and what are the predictions for his career path over the next decade
Primary data was collected in March 2020 from the Ultimate Tennis Statistics website, which is licensed under a Creative Commons license (CC BY-NC-SA 4.0) and is based on open source software available on GitHub.
More specifically, we collected annual data for the period 2000 to March 2020 from the Association of Tennis Professionals (ATP) world rankings, as published on the Ultimate Tennis Statistics website. We then proceeded to collect profile data published on the same website, regarding 3,912 tennis players included in these rankings.
Indicatively, as can be inferred from Stefanos Tsitsipas’ stats profile on the website, each athlete’s “profile data” includes, among other things, information about the age of the player, the year he turned professional, the seasons he has played, his “backhand“, his favorite surface, the amount of prize money he has earned, the titles he has won, the best rank he has achieved, as well as his current rank, his ELO rank etc.
The aim was to make estimates for the progress of specific players (Stefanos Tsitsipas), using data on the progress of “similar” players. We used the available data to create a single dataset with a set of variables for each player, which we could statistically analyze in order to group athletes based on the degree of similarity with Tsitsipas, and make predictions about their potential career path.
The constructed dataset can be described as a matrix with p columns and n rows. Each row corresponds to a player and the columns contain the values of the variables that we aim to utilize in order to obtain the desired predictions.
Before conducting any statistical analysis, we computed the proportion of missing values across the dataset’s rows and the columns. We found that in several cases, missing values corresponded to more than 50% of row and column contents. We therefore deleted said rows and columns, since the information they would provide would be negligible. The resulting dataset consisted of 1.602 rows (players) and 20 variables (player names and characteristics). In this dataset, the remaining missing values (those retained after the deletion of rows and columns with missing values in more than 50% of their contents) were imputed by using machine learning techniques and the statistical programming language R.
Dataset sample to be analyzed
|Guillermo Garcia Lopez||470796.06||23||0.46||0.42||0.5||0.48||501.0||132.0||115.0||9.0||18.0||36.0||3||2||16||0.11764705882352902||0.17647058823529396||0.294117647058824||13.062180285599199|
|Jo Wilfried Tsonga||1379538.44||5||0.68||0.68||0.64||0.69||399.0||106.0||231.0||8.0||18.0||34.0||3||3||6||1.0625||0.0625||1.125||14.137259537419801|
|Daniel Gimeno Traver||227631.36||48||0.36||0.25||0.42||0.1||619.0||12.0||75.0||9.0||18.0||34.0||3||3||4||12.3354827573315|
|Yen Hsun Lu||282170.35||33||0.42||0.42||0.24||0.44||351.0||2.0||103.0||9.0||17.0||36.0||3||3||6||12.5502662455528|
|Rogerio Dutra Silva||166942.0||63||0.32||0.29||0.36||0.01||138.0||144.0||351.0||14.0||19.0||36.0||3||2||15||12.025401725685198|
|Ruben Ramirez Hidalgo||167582.5||50||0.34||0.19||0.39||0.01||179.0||13.0||61.0||8.0||20.0||42.0||3||3||4||12.029231046304|
|Daniel Munoz De La Nava||138074.38||68||0.24||0.3||0.22||205.0||241.0||276.0||17.0||17.0||38.0||2||3||10||11.835547804446099|
|Adrian Menendez Maceiras||199829.83||111||0.25||0.29||0.17||0.25||298.0||76.0||208.0||10.0||19.0||34.0||3||3||10||12.2052214333519|
|Teodor Dacian Craciun||218||141.0||253.0||51.0||9.0||17.0||39.0||3||3||1|
|Paul Henri Mathieu||370534.88||12||0.48||0.46||0.49||0.42||125.0||114.0||47.0||9.0||17.0||38.0||3||3||3||0.11764705882352902||0.23529411764705901||12.822702862337|
|Marc Fornell Mestres||236||0.01||0.01||94.0||178.0||66.0||38.0||3||1||1|
Handling missing values
In order to impute the remaining missing values of the dataset, the missForest R package was used (see MissForest —non-parametric missing value imputation for mixed-type data). It is a machine learning method based on the Random Forest algorithm, which can briefly be described as a bunch of “decision trees” bundled together.
Random Forest can be explained with the following simple example: let us consider we have a sample of people whose gender is unknown to us (missing value) and that we wish to predict it based on their height and weight data, which we already have. We construct a “decision tree” with two “nodes” for this simple working hypothesis: The first node consists of a question regarding the height of each person and, if the height is greater than 1.80 m, the person is classified as “man”. Conversely, if the person’s height is less than 1.80 m, we move on to the second node – which is made up of a question about the weight of the person – in order to decide: if the person weighs less than 70 kg, they are classified as “woman”. In more complicated examples, the number of nodes is increased and a large number of decision trees can be combined in order to make the desired prediction.
In the case of missing values, we essentially end up with estimated values, with which the missing ones are replaced, training the algorithm by feeding it available data. In this dataset, on which the statistical analysis for the specific story is based, we have included a common list of characteristics for each player (ranking positions, prize rewards, “backhand”, favorite surface, age turned professional, etc.). Each characteristic is a variable, i.e. a column in the dataset.
Let us assume we lack data on a player’s prize money. We have:
- missing values for prize money received by player a
- available values for other characteristics of player a
- υφιστάμενες τιμές για κάποια από τα χαρακτηριστικά της λίστας, στην περίπτωση άλλων αθλητών
- missing values for other characteristics on the list, in the case of other players
We classify the characteristics of players (variables), based on the number of missing values occurring in their dataset in ascending order – starting with the player with the most complete dataset and finishing with the one with the most incomplete one.
Based on this classification, missing values for each player x and for each of his characteristics (variables) are imputed by using the Random Forest algorithm, which is based on: a) available values for other characteristics of player x and b) available values of other players for the characteristic that player x has a missing value. A similar process is followed for each variable containing missing values and repeated several times, until it meets specific performance criteria.
Finding similarities between players
After replacing the missing values, the dataset of 1.602 rows (each player, one row) and 20 columns (each column, one player characteristic) was completed. For each possible pair of players, a similarity index was calculated, using the “radial base function” (RBF kernel). This similarity index is a positive number informing us about the degree of similarity between a player and any athlete in the dataset.
Subsequently, a “similarity matrix” is created, i.e. a matrix that integrates this similarity index to compare coupled players. It is a 1,602 x 1,602 table – rows and columns correspond to the study’s players, thus forming the possible pairs of players: in each cell of the similarity matrix, there is a value that corresponds to the similarity index between the two reference tennis players.
Next, in order for players who are similar to each other to be categorized in groups, we adopt the cluster analysis technique and, in particular, the technique of spectral clustering – the similarity matrix is quite large (1,602 x 1,602) and it would be easier to work with a matrix of smaller dimensions, which, however, would contain as much information as possible from the original similarity matrix.
We follow these steps:
- Original similarity matrix transformation to a normalized Laplacian matrix
- Laplacian matrix principal component analysis (spectral decomposition)
- Grouping rows (players) of the original similarity matrix by applying the k-means algorithm to the spectral decomposition result.
We end up with 150 clusters of similar players and we choose the one Stefanos Tsitsipas belongs to, in order to work on predictions for the player’s ranks.
In fact, the cluster Tsitsipas – whose professional career spans five years now – belongs to, includes players who have had at least 15 years of experience. This is crucial for the prediction method, precisely because our prediction technique is based on pre-existing career paths of similar players.
Considering that the research question is “what world ranking positions Stefanos Tsitsipas might occupy in the future” and given that he has been a professional player for the last five years, we created a prediction model over a ten-year horizon by applying the statistical technique of the sample mean.
We predict the player’s performance over the next ten years, based on the average performance of similar players, who, however, have longer career spans.
In other words, we use actual ranking data of players Tsitsipas is most comparable with, based on their performance after a ten and a 15-year old career path, in order to create estimates for Tsitsipas’ performance over the next decade.
In order for the sample mean (that is, our prediction) not to be affected by any outliers and despite the fact that the Tsitsipas cluster consists of only ten members (himself and nine more players), we are looking for sub-groups of similar athletes within the cluster Stefanos Tsitsipas was assigned to. By computing the Euclidean distance of each player from the centroid of the cluster (as calculated by the k-means algorithm itself), we find:
- sub-group of players a, which is located farther from the centroid of the cluster – these are players who tend to rank slightly better compared to the average positions held by the players of this category during each season. Stefanos Tsitsipas also appears in sub-group a.
- sub-group of players b, which is close to the centroid of the cluster – that is, players whose world ranking positions are close to the average positions held by all players of the category in a given season
In this context, players belonging to the Tsitsipas cluster are selected from these two different sub-groups for test applications of the prediction model: by using data on the first five years of their career, we make “pseudo predictions” for their career path over the next years, which we compare to the actual data on their world ranking. In this way, the margin of error of our predictions emerges.
More specifically, for the purposes of testing the model, the following are selected:
- Alexander Zverev, who, despite having only turned professional seven years ago, belongs to sub-group a (like Stefanos Tsitsipas). He also has a very similar trajectory to Tsitsipas and is one of his major rivals today.
- Stan Wawrinka, David Ferrer and Thomas Berdych, who, on the one hand, have more than 15 years of professional experience and, on the other, belong to sub-group b
For Zverev (sub-group a), the test is applied using a sample mean that results only from the positions held by the cluster’s four players, who are the top-ranked players in a given season. For Wawrinka, Ferrer and Berdych (subgroup b), the “pseudo predictions” are made by calculating the sample mean of the total data pertaining to the cluster players in each season.
As shown in the graph below, the actual ranking of players is usually within our model’s prediction intervals.
For the predictions of Stefanos Tsitsipas’ world ranking performance in the next decade, the prediction method that was applied in the very similar case of Zverev is chosen: for the calculation of the sample mean, lower ranked players compared to the cluster median are excluded and the model takes into account the positions held by the top performing players each time.
Taking a closer look at Zverev’s case and the extent to which the chosen method of calculating the predictions affects the estimated trajectory of Tsitsipas, it is worth noting the following:
Although Zverev finished his sixth professional year ranked No 4., the model had predicted he would have come in 11th (mean prediction) in the respective season, which coincides with the first season of prediction. On the contrary, in the following seasons, the prediction model is “aligned” with Zverev’s actual achievements. The “failure” of the first prediction is explained, if one takes into account that Zverev’s trajectory is similar to that of the other players in the group after the third season of his career. The rapid ascent in world rankings, which characterized both Zverev and Tsitsipas in the first years of their career, is not observed in the case of otherwise identical athletes, whose data, however, “feeds” the model and affects the estimated value for the first prediction season.
It should be noted that a multivariate modelling strategy would be useful for making even more realistic predictions, which is why we suggest this strategy is applied in future studies: in tennis, an athlete’s trajectory significantly depends on their opponents’ progress. In the present model, the working hypothesis is that a player (p) will follow a similar trajectory as his peers’ (n), but it is not assumed that his opponents will have a similar performance to his peers’ (n) opponents considered here.
We especially chose to visualize the Tsitsipas cluster and the similarities between the cluster’s players using a radar plot: radar plots are two dimensional charts suitable for the comparative visualization of multiple (quantitative) variables. Especially when more than one case study is visualized, it is the shape of the radar plot that serves the purpose of the visual comparison. In this context, the more similar the players’ shapes are, the more similar the players themselves are to each other. However, radar plots can become too complex and illegible when they include too many overlapping shapes – as in our case, where we needed to visualize data of at least ten athletes (Tsitsipas cluster). For this reason, the interactive design was chosen, in order for users to be able to deselect players or select specific players for comparison.
To visualize the distribution of the 1.602 players in the dataset based on – for example – their winnings in relation to seasons played, the “hexbin plot” was selected: the hexbin plot is essentially a different approach to the scatterplot, which is usually chosen to visualize the relationship between two quantitative variables. When the data is so large that the overlays do not yield a maximum of possible information, the “hexbin plot” is appropriate, as the collected data is plotted on hexagonal grids: in this case, the darker the blue color of each hexbin, the more players are concentrated on that point.
This is the result of a collaboration between iMEdD Lab and AUEB Sports Analytics Group research team aiming to promote robust quantitative analysis in sports, on both an academic and a professional level. The team works in the field of “Sports Analytics”, which includes the creation of statistical models and the production of predictions regarding sports results, draws on sports economics, and uses such tools as performance analysis, and visualization and measurement of competitive balance.
Translation: Anatoli Stavroulopoulou