Analysis of Stefanos Tsitsipas’ professional trajectory

Λογότυπο της ερευνητικής ομάδας Sports Analytics του Οικονομικού Πανεπιστημίου Αθηνών
Go to the story Download the data

How we grouped 1,602 tennis players to find out who is more similar to Stefanos Tsitsipas and to predict the latter’s career path over the next decade, by applying simple machine learning techniques and statistical methods.

Data collection

Primary data was collected in March 2020 from the Ultimate Tennis Statistics website, which is licensed under a Creative Commons license (CC BY-NC-SA 4.0) and is based on open source software available on GitHub.

More specifically, we collected annual data for the period 2000 to March 2020 from the Association of Tennis Professionals (ATP) world rankings, as published on the Ultimate Tennis Statistics website. We then proceeded to collect profile data published on the same website, regarding 3,912 tennis players included in these rankings.

Indicatively, as can be inferred from Stefanos Tsitsipas’ stats profile on the website, each athlete’s “profile data” includes, among other things, information about the age of the player, the year he turned professional, the seasons he has played, his “backhand“, his favorite surface, the amount of prize money he has earned, the titles he has won, the best rank he has achieved, as well as his current rank, his ELO rank etc.

The aim was to make estimates for the progress of specific players (Stefanos Tsitsipas), using data on the progress of “similar” players. We used the available data to create a single dataset with a set of variables for each player, which we could statistically analyze in order to group athletes based on the degree of similarity with Tsitsipas, and make predictions about their potential career path.

Data cleaning

The constructed dataset can be described as a matrix with p columns and n rows. Each row corresponds to a player and the columns contain the values of the variables that we aim to utilize in order to obtain the desired predictions.

Before conducting any statistical analysis, we computed the proportion of missing values across the dataset’s rows and the columns. We found that in several cases, missing values ​​corresponded to more than 50% of row and column contents. We therefore deleted said rows and columns, since the information they would provide would be negligible. The resulting dataset consisted of 1.602 rows (players) and 20 variables (player names and characteristics). In this dataset, the remaining missing values ​​(those retained after the deletion of rows and columns with missing values ​​in more than 50% of their contents) were imputed by using machine learning techniques and the statistical programming language R.

Dataset sample to be analyzed

Feliciano Lopez777932.27120.520.510.490.65112.
Nicolas Mahut513113.95370.440.410.30.62172.053.0175.
Tommy Robredo636963.5750.60.560.660.54101.
Paolo Lorenzi351977.79330.380.360.420.299.0415.0134.
Ivo Karlovic468826.9140.520.520.420.6393.08.0128.
Roger Federer5618777.8710.820.830.760.8716.
Guillermo Garcia Lopez470796.06230.460.420.50.48501.0132.0115.
Jo Wilfried Tsonga1379538.4450.680.680.640.69399.0106.0231.
Fernando Verdasco913358.4770.570.540.610.55291.
Andreas Seppi587859.78180.480.460.50.57444.0113.
Philipp Kohlschreiber674084.05160.560.540.570.6512.039.0120.
Teymuraz Gabashvili291151.07430.370.360.390.2771.0596.
Rafael Nadal6294819.010.830.780.920.78611.0151.
Jurgen Melzer519159.480.510.510.520.53190.
Dustin Brown264191.91640.390.340.40.45198.0293.0198.
Stan Wawrinka1877237.7230.640.640.670.5489.03.0114.
Richard Gasquet953971.4270.630.620.620.6768.
David Ferrer1749106.1730.660.640.70.63197.0150.
Go Soeda135495.27470.380.370.310.24249.0133.0200.035.033311.816692010943502
Carlos Berlocq296433.93370.410.330.460.25292.0364.
Marcel Granollers772940.57190.450.410.480.45445.0119.0110.
Gilles Muller315361.79210.520.530.460.56305.0280.
Dudi Sela259822.33290.420.450.190.48214.049.0138.
Daniel Gimeno Traver227631.36480.360.250.420.1619.
Julien Benneteau530930.11250.480.490.410.47149.018.0115.
Novak Djokovic10.830.840.80.84493.0108.
Yen Hsun Lu282170.35330.420.420.240.44351.02.0103.
Viktor Troicki575913.2120.520.530.510.52181.0449.0126.
Florian Mayer485266.13180.480.440.50.59480.0143.0215.
Rogerio Dutra Silva166942.0630.320.290.360.01138.0144.0351.
Janko Tipsarevic506824.9480.530.540.520.53453.
Ruben Ramirez Hidalgo167582.5500.340.190.390.01179.
Mischa Zverev403370.93250.40.410.320.535.026.0444.
Mikhail Youzhny713222.580.540.550.510.5755.
Gael Monfils1054753.8860.640.660.610.59686.0209.
Andy Murray4102933.810.770.780.70.84129.0347.
Marco Chiudinelli134908.0520.350.350.260.416.0109.
Marcos Baghdatis557432.3180.560.570.430.58376.038.0104.
Gilles Simon869037.8860.580.580.580.57310.
Sergiy Stakhovsky314649.47310.450.460.380.48198.0151.
Simone Bolelli360726.07360.430.360.480.52354.019.0123.
Stephane Robert218865.73500.340.360.310.33573.0103.
Radek Stepanek597024.4280.560.560.550.6265.0479.
Jaroslav Pospisil179194.671030.120.330.010.01157.099.0338.039.031112.096228035777099
Lukasz Kubot590473.08410.430.360.470.5147.
Jan Mertl300402.51630.670.990.587.052.0229.
Daniel Munoz De La Nava138074.38680.240.30.22205.0241.0276.
Frank Dancevic117914.62650.380.380.050.48233.
Lamine Ouahab38605.171140.50.380.53361.076.0114.
Giovanni Lapentti49029.01100.340.460.240.33505.0142.0149.
Fabio Fognini841913.3890.540.480.590.51415.058.0152.
Flavio Cipolla203031.25700.350.380.360.12319.
Filippo Volandri263308.73250.440.130.540.1548.060.0106.
Santiago Giraldo61.57280.450.390.510.43390.0152.
Adrian Menendez Maceiras199829.831110.
Jan Hernych143583.67590.40.420.320.4893.
Maximo Gonzalez203559.91580.320.190.370.0150.0250.0265.
Michal Przysiezny103031.92570.
Albert Montanes345078.82220.470.30.530.32109.
Nicolas Almagro672014.6290.590.470.660.47106.0582.
Konstantin Kravchuk114634.22780.
Lleyton Hewitt1043996.710.70.70.640.766.
Tomas Berdych1734784.040.650.650.630.68285.
Igor Sijsling216377.5520.360.350.30.44101.0481.
Teodor Dacian Craciun218141.0253.
Denis Istomin372861.12330.470.460.450.54662.
Steve Darcis220764.13380.470.470.470.46114.0215.0330.
Kevin Anderson1163846.7150.590.610.530.59101.0249.0296.
Dmitry Tursunov394675.0200.510.520.40.57146.0146.0222.
Michael Berrer212823.15420.380.410.30.29127.0152.
Tobias Kamke216612.27640.380.370.360.4693.0271.0235.
Paul Henri Mathieu370534.88120.480.460.490.42125.0114.
Frederico Gil145393.3620.470.390.550.12344.0125.
Malek Jaziri294278.0420.410.420.430.3168.0201.0158.
Ernests Gulbis453547.06100.510.50.520.38277.
Sam Querrey794143.47110.560.560.450.64485.
Blaz Kavcic163640.09680.390.370.420.2458.060.0106.
Toshihide Matsui98480.672610.670.6237.072.0223.041.033111.4976155642471
Juan Monaco577459.79100.560.460.630.39315.0146.0251.
Robin Haase508178.29330.460.440.510.4502.
Michael Russell156858.0600.340.370.260.368.070.0345.
Jimmy Wang87971.23850.460.460.290.39516.0197.
Lukas Lacko232969.07440.40.420.20.42190.092.0186.
Tommy Haas680499.3520.630.640.590.6315.
Pablo Andujar411965.85320.410.290.510.12611.0146.
Adrian Mannarino531070.77220.460.460.260.59457.0122.0190.
Matthias Bachinger142133.73850.360.390.370.22316.0159.
Lukas Rosol300705.64260.440.390.520.4328.
Jeremy Chardy577229.53250.50.480.530.48303.069.0117.
Matteo Viola123883.81180.240.330.010.01142.0154.0402.
Marc Fornell Mestres2360.010.0194.0178.066.038.0311
Pablo Cuevas530023.94190.530.410.60.41480.0124.0117.
Potito Starace315379.17270.460.310.530.0884.0200.
Victor Hanescu306932.21260.450.340.540.41265.040.0102.
Ivo Klec68838.171840.220.20.25282.0309.0264.
Leonardo Mayer499417.46210.480.430.530.46453.
Carlos Salamanca55554.831370.380.010.75197.048.0446.
Donald Young308544.0380.40.420.220.3859.0394.
Alejandro Falla193833.19500.40.390.430.39291.0148.0104.

Handling missing values

Imputation of missing values

#clear the R environment
rm(list = ls())
#load the MissForest R library to impute the dataset
#read data for imputation 
cleaned_data = read.csv("cleaned_raw_data.csv")
#remove some unused variables 
mydata = cleaned_data[,2:6,8:10]
#log and logit transformations when needed
mydata[,c(1:2)] = log(mydata[,1:2])
mydata[,c(3:5)] =log(mydata[,c(3:5)]/(1-mydata[,c(3:5)]))
#impute missing values
mydata.imp <- missForest(mydata)$ximp

In order to impute the remaining missing values ​​of the dataset, the missForest R package was used (see MissForest —non-parametric missing value imputation for mixed-type data). It is a machine learning method based on the Random Forest algorithm, which can briefly be described as a bunch of “decision trees” bundled together.

Random Forest can be explained with the following simple example: let us consider we have a sample of people whose gender is unknown to us (missing value) and that we wish to predict it based on their height and weight data, which we already have. We construct a “decision tree” with two “nodes” for this simple working hypothesis: The first node consists of a question regarding the height of each person and, if the height is greater than 1.80 m, the person is classified as “man”. Conversely, if the person’s height is less than 1.80 m, we move on to the second node – which is made up of a question about the weight of the person – in order to decide: if the person weighs less than 70 kg, they are classified as “woman”. In more complicated examples, the number of nodes is increased and a large number of decision trees can be combined in order to make the desired prediction.

In the case of missing values, we essentially end up with estimated values, with which the missing ones are replaced, training the algorithm by feeding it available data. In this dataset, on which the statistical analysis for the specific story is based, we have included a common list of characteristics for each player (ranking positions, prize rewards, “backhand”, favorite surface, age turned professional, etc.). Each characteristic is a variable, i.e. a column in the dataset.

Let us assume we lack data on a player’s prize money. We have:

  • missing values ​​for prize money received by player a
  • available values ​​for other characteristics of player a
  • υφιστάμενες τιμές για κάποια από τα χαρακτηριστικά της λίστας, στην περίπτωση άλλων αθλητών 
  • missing values ​​for other characteristics on the list, in the case of other players

We classify the characteristics of players (variables), based on the number of missing values ​​occurring in their dataset in ascending order – starting with the player with the most complete dataset and finishing with the one with the most incomplete one.

Based on this classification, missing values for each player x and for each of his characteristics (variables) ​​are imputed by using the Random Forest algorithm, which is based on: a) available values ​​for other characteristics of player x and b) available values ​​of other players for the characteristic that player x has a missing value. A similar process is followed for each variable containing missing values ​​and repeated several times, until it meets specific performance criteria.

Finding similarities between players

Clustering players by similarity and selecting Tsitsipas’ cluster

#clear the R environment
rm(list = ls())
#load R libraries
library(KRLS)#to compute RBF
library(irlba) #to compute partial eigen decomposition
library(spam)#to facilitate linear algebra computations
library(spam64)#to facilitate linear algebra computations

#read and clean the data
cleaned_data = read.csv("cleaned_raw_data.csv") 
names = cleaned_data$names
mydata = cleaned_data[,2:6,8:10] 

#log and logit transformations
mydata[,c(1:2)] = log(mydata[,1:2])
mydata[,c(3:5)] =log(mydata[,c(3:5)]/(1-mydata[,c(3:5)]))

#impute missing values
mydata.imp <- missForest(mydata)$ximp

#compute similarity matrix
mysimils =gausskernel(mydata.imp, sigma=1) #introduce sparsity to facilitate linear algebra computations by setting low similarities equal to 0

#compute regularized Laplacian of similarity matrix
diag(mysimils) =0
N= dim(mysimils)[1]
DD = rep(NA,N)
for(i in 1:N) DD[i] = sum(mysimils[i,])

DD = DD + mean(DD)
myd =diag.spam(1/sqrt(DD))
tSimils =myd%*%mysimils%*%myd

#compute partial spectral decomposition of Laplacian
myeigen = partial_eigen(as.dgCMatrix.spam(tSimils), n =K, symmetric = TRUE)
U = myeigen$vectors

#normalize eigenvectors to have unit length
scalar1 <- function(x) {x / sqrt(sum(x^2))}
Ustar = matrix(NA,dim(U)[1],dim(U)[2])
for(i in 1:dim(U)[1]){
  Ustar[i,] = scalar1(U[i,])

#run k-means
km = kmeans(Ustar,centers = 150,iter.max = 1000)

#find Tsitsipas cluster
size_clsts = km$size
clsts = which(size_clsts>=2)
same_players =list()
for(i in 1:length(clsts)){
  same_players[[i]] =  which( km$cluster==clsts[i] ) 
  if(length(  intersect(same_players[[i]],which(names=="Stefanos Tsitsipas")))>0)
    tsitsipas_clst[i] = 1
select_tsitsip_clst = same_players[[which(tsitsipas_clst>0)]]
select_tsitsip_clst=select_tsitsip_clst[-5] #we exclude the fifth member of the cluster because he seems to be an outlier due to two injuries that he had and he stopped and started again
tsitsipas_clst_data = cleaned_data[select_tsitsip_clst,]
tsitsipas_clst_names = names[select_tsitsip_clst]
The whole script, predictions included, is available on GitHub.

After replacing the missing values, the dataset of 1.602 rows (each player, one row) and 20 columns (each column, one player characteristic) was completed. For each possible pair of players, a similarity index was calculated, using the “radial base function” (RBF kernel). This similarity index is a positive number informing us about the degree of similarity between a player and any athlete in the dataset.

Subsequently, a “similarity matrix” is created, i.e. a matrix that integrates this similarity index to compare coupled players. It is a 1,602 x 1,602 table – rows and columns correspond to the study’s players, thus forming the possible pairs of players: in each cell of the similarity matrix, there is a value that corresponds to the similarity index between the two reference tennis players.

Next, in order for players who are similar to each other to be categorized in groups, we adopt the cluster analysis technique and, in particular, the technique of spectral clustering – the similarity matrix is quite large (1,602 x 1,602) and it would be easier to work with a matrix of smaller dimensions, which, however, would contain as much information as possible from the original similarity matrix.

We follow these steps:

We end up with 150 clusters of similar players and we choose the one Stefanos Tsitsipas belongs to, in order to work on predictions for the player’s ranks.

In fact, the cluster Tsitsipas – whose professional career spans five years now – belongs to, includes players who have had at least 15 years of experience. This is crucial for the prediction method, precisely because our prediction technique is based on pre-existing career paths of similar players.

Predictions method

Considering that the research question is “what world ranking positions Stefanos Tsitsipas might occupy in the future” and given that he has been a professional player for the last five years, we created a prediction model over a ten-year horizon by applying the statistical technique of the sample mean.

We predict the player’s performance over the next ten years, based on the average performance of similar players, who, however, have longer career spans.

In other words, we use actual ranking data of players Tsitsipas is most comparable with, based on their performance after a ten and a 15-year old career path, in order to create estimates for Tsitsipas’ performance over the next decade.

In order for the sample mean (that is, our prediction) not to be affected by any outliers ​​and despite the fact that the Tsitsipas cluster consists of only ten members (himself and nine more players), we are looking for sub-groups of similar athletes within the cluster Stefanos Tsitsipas was assigned to. By computing the Euclidean distance of each player from the centroid of the cluster (as calculated by the k-means algorithm itself), we find:

  • sub-group of players a, which is located farther from the centroid of the cluster – these are players who tend to rank slightly better compared to the average positions held by the players of this category during each season. Stefanos Tsitsipas also appears in sub-group a.  
  • sub-group of players b, which is close to the centroid of the cluster – that is, players whose world ranking positions are close to the average positions held by all players of the category in a given season

In this context, players belonging to the Tsitsipas cluster are selected from these two different sub-groups for test applications of the prediction model: by using data on the first five years of their career, we make “pseudo predictions” for their career path over the next years, which we compare to the actual data on their world ranking. In this way, the margin of error of our predictions emerges.

  • <iframe src="" height="250" width="100%" allow="fullscreen"></iframe>
The evolution of ranking for four selected players “similar” to Tsitsipas over the first five seasons of their career.

More specifically, for the purposes of testing the model, the following are selected:

  • Alexander Zverev, who, despite having only turned professional seven years ago, belongs to sub-group a (like Stefanos Tsitsipas). He also has a very similar trajectory to Tsitsipas and is one of his major rivals today.
  • Stan Wawrinka, David Ferrer and Thomas Berdych, who, on the one hand, have more than 15 years of professional experience and, on the other, belong to sub-group b

For Zverev (sub-group a), the test is applied using a sample mean that results only from the positions held by the cluster’s four players, who are the top-ranked players in a given season. For Wawrinka, Ferrer and Berdych (subgroup b), the “pseudo predictions” are made by calculating the sample mean of the total data pertaining to the cluster players in each season.

As shown in the graph below, the actual ranking of players is usually within our model’s prediction intervals.

  • <iframe src="" height="600" width="100%" allow="fullscreen"></iframe>
Testing our prediction model on four selected players from Tsitsipas’ cluster

For the predictions of Stefanos Tsitsipas’ world ranking performance in the next decade, the prediction method that was applied in the very similar case of Zverev is chosen: for the calculation of the sample mean, lower ranked players compared to the cluster median are excluded and the model takes into account the positions held by the top performing players each time.

Taking a closer look at Zverev’s case and the extent to which the chosen method of calculating the predictions affects the estimated trajectory of Tsitsipas, it is worth noting the following:

Although Zverev finished his sixth professional year ranked No 4., the model had predicted he would have come in 11th (mean prediction) in the respective season, which coincides with the first season of prediction. On the contrary, in the following seasons, the prediction model is “aligned” with Zverev’s actual achievements. The “failure” of the first prediction is explained, if one takes into account that Zverev’s trajectory is similar to that of the other players in the group after the third season of his career. The rapid ascent in world rankings, which characterized both Zverev and Tsitsipas in the first years of their career, is not observed in the case of otherwise identical athletes, whose data, however, “feeds” the model and affects the estimated value for the first prediction season.

It should be noted that a multivariate modelling strategy would be useful for making even more realistic predictions, which is why we suggest this strategy is applied in future studies: in tennis, an athlete’s trajectory significantly depends on their opponents’ progress. In the present model, the working hypothesis is that a player (p) will follow a similar trajectory as his peers’ (n), but it is not assumed that his opponents will have a similar performance to his peers’ (n) opponents considered here.

Data visualization

Charts that visualize statistical data analysis were created using the Python programming language and the statistical programming language R.

More specifically, the Plotly Python Open Source Graphing Library was used to create interactive charts, while static graphs were created using the ggplot2 package in R.

  • <iframe src="" height="500" width="100%" allow="fullscreen"></iframe>
Radar plot representing Tsitsipas’ cluster and similarity between players. The more similar the shapes are, the more similar the players are.

We especially chose to visualize the Tsitsipas cluster and the similarities between the cluster’s players using a radar plot: radar plots are two dimensional charts suitable for the comparative visualization of multiple (quantitative) variables. Especially when more than one case study is visualized, it is the shape of the radar plot that serves the purpose of the visual comparison. In this context, the more similar the players’ shapes are, the more similar the players themselves are to each other. However, radar plots can become too complex and illegible when they include too many overlapping shapes – as in our case, where we needed to visualize data of at least ten athletes (Tsitsipas cluster). For this reason, the interactive design was chosen, in order for users to be able to deselect players or select specific players for comparison.

Chart: prize vs seasons
Hexbin plot showing prize money vs number of seasons players count

To visualize the distribution of the 1.602 players in the dataset based on – for example – their winnings in relation to seasons played, the “hexbin plot” was selected: the hexbin plot is essentially a different approach to the scatterplot, which is usually chosen to visualize the relationship between two quantitative variables. When the data is so large that the overlays do not yield a maximum of possible information, the “hexbin plot” is appropriate, as the collected data is plotted on hexagonal grids: in this case, the darker the blue color of each hexbin, the more players are concentrated on that point.

This is the result of a collaboration between iMEdD Lab and AUEB Sports Analytics Group research team aiming to promote robust quantitative analysis in sports, on both an academic and a professional level. The team works in the field of “Sports Analytics”, which includes the creation of statistical models and the production of predictions regarding sports results, draws on sports economics, and uses such tools as performance analysis, and visualization and measurement of competitive balance.

Translation: Anatoli Stavroulopoulou

Λογότυπο Άδειας Χρήσης Creative Commons Non Commercial International