Methodology

Analysis of Stefanos Tsitsipas’ professional trajectory

Λογότυπο της ερευνητικής ομάδας Sports Analytics του Οικονομικού Πανεπιστημίου Αθηνών
Go to the story Download the data

How we grouped 1,602 tennis players to find out who is more similar to Stefanos Tsitsipas and to predict the latter’s career path over the next decade, by applying simple machine learning techniques and statistical methods.

Data collection

Primary data was collected in March 2020 from the Ultimate Tennis Statistics website, which is licensed under a Creative Commons license (CC BY-NC-SA 4.0) and is based on open source software available on GitHub.

More specifically, we collected annual data for the period 2000 to March 2020 from the Association of Tennis Professionals (ATP) world rankings, as published on the Ultimate Tennis Statistics website. We then proceeded to collect profile data published on the same website, regarding 3,912 tennis players included in these rankings.

Indicatively, as can be inferred from Stefanos Tsitsipas’ stats profile on the website, each athlete’s “profile data” includes, among other things, information about the age of the player, the year he turned professional, the seasons he has played, his “backhand“, his favorite surface, the amount of prize money he has earned, the titles he has won, the best rank he has achieved, as well as his current rank, his ELO rank etc.

The aim was to make estimates for the progress of specific players (Stefanos Tsitsipas), using data on the progress of “similar” players. We used the available data to create a single dataset with a set of variables for each player, which we could statistically analyze in order to group athletes based on the degree of similarity with Tsitsipas, and make predictions about their potential career path.

Data cleaning

The constructed dataset can be described as a matrix with p columns and n rows. Each row corresponds to a player and the columns contain the values of the variables that we aim to utilize in order to obtain the desired predictions.

Before conducting any statistical analysis, we computed the proportion of missing values across the dataset’s rows and the columns. We found that in several cases, missing values ​​corresponded to more than 50% of row and column contents. We therefore deleted said rows and columns, since the information they would provide would be negligible. The resulting dataset consisted of 1.602 rows (players) and 20 variables (player names and characteristics). In this dataset, the remaining missing values ​​(those retained after the deletion of rows and columns with missing values ​​in more than 50% of their contents) were imputed by using machine learning techniques and the statistical programming language R.

Dataset sample to be analyzed

namesprize_moneybest_rankoverall_surface_pcthard_surface_pctclay_surface_pctgrass_surface_pctranksDiff1ranksDiff2ranksDiff3best_rank_stdage_turned_proageplaysbackhandfavorite_surfacehard_titles_stdclay_titles_stdtitles_stdprize_money_std
Feliciano Lopez777932.27120.520.510.490.65112.097.034.018.015.038.02290.09090909090909090.04545454545454550.3181818181818179513.5643947428156
Nicolas Mahut513113.95370.440.410.30.62172.053.0175.014.018.038.03290.213.1482532242432
Tommy Robredo636963.5750.60.560.660.54101.01.09.08.015.037.03240.04761904761904760.5238095238095240.571428571428571113.364467742966001
Paolo Lorenzi351977.79330.380.360.420.299.0415.0134.014.021.038.033150.07142857142857140.071428571428571412.7713233559987
Ivo Karlovic468826.9140.520.520.420.6393.08.0128.08.021.041.03290.190476190476190.04761904761904760.3809523809523810413.057988896144801
Roger Federer5618777.8710.820.830.760.8716.07.04.06.016.038.03293.086956521739130.47826086956521714.478260869565219515.5416247373677
Guillermo Garcia Lopez470796.06230.460.420.50.48501.0132.0115.09.018.036.032160.117647058823529020.176470588235293960.29411764705882413.062180285599199
Jo Wilfried Tsonga1379538.4450.680.680.640.69399.0106.0231.08.018.034.03361.06250.06251.12514.137259537419801
Fernando Verdasco913358.4770.570.540.610.55291.064.073.08.017.036.02340.105263157894736990.2631578947368420.36842105263157913.724883711215199
Andreas Seppi587859.78180.480.460.50.57444.0113.094.011.018.036.03390.05555555555555560.05555555555555560.16666666666666713.2842437290547
Philipp Kohlschreiber674084.05160.560.540.570.6512.039.0120.011.017.036.032140.05263157894736840.3157894736842110.42105263157894713.421110085383699
Teymuraz Gabashvili291151.07430.370.360.390.2771.0596.023.015.015.034.0331212.5815975523401
Rafael Nadal6294819.010.830.780.920.78611.0151.02.07.014.033.02341.15789473684211023.10526315789474034.4736842105263215.655237472068698
Jurgen Melzer519159.480.510.510.520.53190.077.012.012.017.038.03320.20.050.2513.159966244087999
Dustin Brown264191.91640.390.340.40.45198.0293.0198.014.017.035.033912.484431049859701
Stan Wawrinka1877237.7230.640.640.670.5489.03.0114.012.016.034.032150.50.388888888888888950.888888888888889114.4453119564572
Richard Gasquet953971.4270.630.620.620.6768.014.091.05.015.033.03290.4210526315789470.1578947368421050.789473684210525913.7683889919104
David Ferrer1749106.1730.660.640.70.63197.0150.012.013.017.037.03340.6666666666666670.72222222222222211.514.3746154554174
Go Soeda135495.27470.380.370.310.24249.0133.0200.035.033311.816692010943502
Carlos Berlocq296433.93370.410.330.460.25292.0364.010.011.018.037.03240.142857142857143020.1428571428571430212.5995796395367
Marcel Granollers772940.57190.450.410.480.45445.0119.0110.09.016.033.033160.07142857142857140.2142857142857140.2857142857142860313.5579574423371
Gilles Muller315361.79210.520.530.460.56305.0280.060.016.017.036.02360.05263157894736840.1052631578947369912.661475798423199
Dudi Sela259822.33290.420.450.190.48214.049.0138.07.016.034.032612.4677533302564
Daniel Gimeno Traver227631.36480.360.250.420.1619.012.075.09.018.034.033412.3354827573315
Julien Benneteau530930.11250.480.490.410.47149.018.0115.014.018.038.033313.182385671975801
Novak Djokovic10.830.840.80.84493.0108.062.08.015.032.03363.47058823529412040.8235294117647064.64705882352941
Yen Hsun Lu282170.35330.420.420.240.44351.02.0103.09.017.036.033612.5502662455528
Viktor Troicki575913.2120.520.530.510.52181.0449.0126.05.020.034.03320.20.213.263712233878
Florian Mayer485266.13180.480.440.50.59480.0143.0215.010.017.036.03390.06666666666666670.13333333333333313.0924527410764
Rogerio Dutra Silva166942.0630.320.290.360.01138.0144.0351.014.019.036.0321512.025401725685198
Janko Tipsarevic506824.9480.530.540.520.53453.022.044.010.017.035.03320.176470588235293960.05882352941176470.2352941176470590113.1359209369523
Ruben Ramirez Hidalgo167582.5500.340.190.390.01179.013.061.08.020.042.033412.029231046304
Mischa Zverev403370.93250.40.410.320.535.026.0444.012.017.032.02390.071428571428571412.9076118394366
Mikhail Youzhny713222.580.540.550.510.5755.026.011.09.016.037.03230.30.150.513.477548712426401
Gael Monfils1054753.8860.640.660.610.59686.0209.016.012.017.033.033100.5294117647058820.05882352941176470.58823529411764713.8688180085766
Andy Murray4102933.810.770.780.70.84129.0347.047.011.017.032.03392.26666666666666970.23.0666666666666715.227212836758499
Marco Chiudinelli134908.0520.350.350.260.416.0109.033.010.018.038.033311.812348343624999
Marcos Baghdatis557432.3180.560.570.430.58376.038.0104.03.017.034.03360.18750.2513.2310963579044
Gilles Simon869037.8860.580.580.580.57310.053.079.07.017.035.03320.5294117647058820.2941176470588240.82352941176470613.675141993631199
Sergiy Stakhovsky314649.47310.450.460.380.48198.0151.011.07.017.034.03260.176470588235293960.2352941176470590112.659214504542401
Simone Bolelli360726.07360.430.360.480.52354.019.0123.06.017.034.0321612.7958741404096
Stephane Robert218865.73500.340.360.310.33573.0103.027.015.020.039.033212.2962137157501
Radek Stepanek597024.4280.560.560.550.6265.0479.017.010.017.041.03320.2631578947368420.26315789473684213.299713296060801
Jaroslav Pospisil179194.671030.120.330.010.01157.099.0338.039.031112.096228035777099
Lukasz Kubot590473.08410.430.360.470.5147.013.069.08.019.037.0331613.288679325096
Jan Mertl300402.51630.670.990.587.052.0229.05.020.038.031112.6128785210745
Daniel Munoz De La Nava138074.38680.240.30.22205.0241.0276.017.017.038.0231011.835547804446099
Frank Dancevic117914.62650.380.380.050.48233.030.017.04.018.035.032311.6777160822304
Lamine Ouahab38605.171140.50.380.53361.076.0114.07.017.035.033410.561141484307901
Giovanni Lapentti49029.01100.340.460.240.33505.0142.0149.03.019.037.0331010.8001672387612
Fabio Fognini841913.3890.540.480.590.51415.058.0152.015.016.032.03340.06250.50.562513.643432413823302
Flavio Cipolla203031.25700.350.380.360.12319.051.070.09.019.036.0321512.2211151870629
Filippo Volandri263308.73250.440.130.540.1548.060.0106.010.015.038.03240.1333333333333330.13333333333333312.4810825010305
Santiago Giraldo61.57280.450.390.510.43390.0152.034.08.018.032.03344.12017473892312
Adrian Menendez Maceiras199829.831110.250.290.170.25298.076.0208.010.019.034.0331012.2052214333519
Jan Hernych143583.67590.40.420.320.4893.035.034.011.018.040.033911.8746732104669
Maximo Gonzalez203559.91580.320.190.370.0150.0250.0265.07.018.036.033412.2237156385726
Michal Przysiezny103031.92570.290.280.210.3350.0135.0107.013.017.036.032311.542794122114401
Albert Montanes345078.82220.470.30.530.32109.013.03.011.018.039.03340.35294117647058790.352941176470587912.7515281336877
Nicolas Almagro672014.6290.590.470.660.47106.0582.053.08.017.034.03240.81250.812513.418035375220999
Konstantin Kravchuk114634.22780.260.220.50.228.0295.02.012.019.035.0331411.649501642529
Lleyton Hewitt1043996.710.70.70.640.766.01.016.03.017.039.03391.00.11.513.8585668865002
Tomas Berdych1734784.040.650.650.630.68285.068.021.013.016.034.033140.5294117647058820.117647058823529020.764705882352940914.3663934679356
Igor Sijsling216377.5520.360.350.30.44101.0481.083.09.017.032.032912.284779846426801
Teodor Dacian Craciun218141.0253.051.09.017.039.0331
Denis Istomin372861.12330.470.460.450.54662.04.030.08.017.033.03390.06250.12512.8289612968533
Steve Darcis220764.13380.470.470.470.46114.0215.0330.014.018.035.03220.06666666666666670.06666666666666670.13333333333333312.3048501254777
Kevin Anderson1163846.7150.590.610.530.59101.0249.0296.011.020.033.03360.428571428571428940.4285714285714289413.9672412061614
Dmitry Tursunov394675.0200.510.520.40.57146.0146.0222.06.017.037.03330.333333333333333040.4666666666666670612.8858179203999
Michael Berrer212823.15420.380.410.30.29127.0152.026.011.018.039.0221012.2682168181267
Tobias Kamke216612.27640.380.370.360.4693.0271.0235.07.017.033.033912.285864260144
Paul Henri Mathieu370534.88120.480.460.490.42125.0114.047.09.017.038.03330.117647058823529020.2352941176470590112.822702862337
Frederico Gil145393.3620.470.390.550.12344.0125.010.08.017.034.033411.8871977632399
Malek Jaziri294278.0420.410.420.430.3168.0201.0158.016.019.036.0331512.5922801777746
Ernests Gulbis453547.06100.510.50.520.38277.080.08.010.015.031.03330.31250.06250.37513.024854313826099
Sam Querrey794143.47110.560.560.450.64485.067.024.012.018.032.03390.53333333333333290.06666666666666670.66666666666666713.5850194166015
Blaz Kavcic163640.09680.390.370.420.2458.060.0106.07.018.033.0331512.005424722031
Toshihide Matsui98480.672610.670.6237.072.0223.041.033111.4976155642471
Juan Monaco577459.79100.560.460.630.39315.0146.0251.010.017.035.03340.07142857142857140.57142857142857110.642857142857142913.2663940912483
Robin Haase508178.29330.460.440.510.4502.053.02.07.017.032.03340.142857142857143020.1428571428571430213.1385876295539
Michael Russell156858.0600.340.370.260.368.070.0345.09.019.041.0331011.9630962164622
Jimmy Wang87971.23850.460.460.290.39516.0197.026.05.016.035.033311.3847651081883
Lukas Lacko232969.07440.40.420.20.42190.092.0186.08.017.032.033612.358660976955099
Tommy Haas680499.3520.630.640.590.6315.03.06.06.017.041.03220.550.10.7513.430582145893199
Pablo Andujar411965.85320.410.290.510.12611.0146.069.011.018.034.03340.3076923076923080.30769230769230812.9286957365467
Adrian Mannarino531070.77220.460.460.260.59457.0122.0190.014.015.031.02390.076923076923076913.1826505681797
Matthias Bachinger142133.73850.360.390.370.22316.0159.052.06.017.032.0331511.8645236539685
Lukas Rosol300705.64260.440.390.520.4328.017.089.010.018.034.03340.07142857142857140.07142857142857140.1428571428571430212.613887125036198
Jeremy Chardy577229.53250.50.480.530.48303.069.0117.08.018.033.03320.06666666666666670.066666666666666713.2659952653493
Matteo Viola123883.81180.240.330.010.01142.0154.0402.09.016.032.0331011.727099308463302
Marc Fornell Mestres2360.010.0194.0178.066.038.0311
Pablo Cuevas530023.94190.530.410.60.41480.0124.0117.012.018.034.03240.3750.37513.1806774543195
Potito Starace315379.17270.460.310.530.0884.0200.033.06.019.038.033412.6615309082103
Victor Hanescu306932.21260.450.340.540.41265.040.0102.010.017.038.03240.07142857142857140.071428571428571412.634382187854
Ivo Klec68838.171840.220.20.25282.0309.0264.07.018.039.033111.1395136665904
Leonardo Mayer499417.46210.480.430.530.46453.067.064.012.015.032.03240.1538461538461540.15384615384615413.121197618171001
Carlos Salamanca55554.831370.380.010.75197.048.0446.09.018.037.023110.9251257399828
Donald Young308544.0380.40.420.220.3859.0394.038.08.014.030.023612.639619737765301
Alejandro Falla193833.19500.40.390.430.39291.0148.0104.012.016.036.023212.1747532228056

Handling missing values

Imputation of missing values

 
#clear the R environment
rm(list = ls())
#load the MissForest R library to impute the dataset
library(missForest)
#read data for imputation 
cleaned_data = read.csv("cleaned_raw_data.csv")
#remove some unused variables 
mydata = cleaned_data[,2:6,8:10]
#log and logit transformations when needed
mydata[,c(1:2)] = log(mydata[,1:2])
mydata[,c(3:5)] =log(mydata[,c(3:5)]/(1-mydata[,c(3:5)]))
#impute missing values
mydata.imp <- missForest(mydata)$ximp

In order to impute the remaining missing values ​​of the dataset, the missForest R package was used (see MissForest —non-parametric missing value imputation for mixed-type data). It is a machine learning method based on the Random Forest algorithm, which can briefly be described as a bunch of “decision trees” bundled together.

Random Forest can be explained with the following simple example: let us consider we have a sample of people whose gender is unknown to us (missing value) and that we wish to predict it based on their height and weight data, which we already have. We construct a “decision tree” with two “nodes” for this simple working hypothesis: The first node consists of a question regarding the height of each person and, if the height is greater than 1.80 m, the person is classified as “man”. Conversely, if the person’s height is less than 1.80 m, we move on to the second node – which is made up of a question about the weight of the person – in order to decide: if the person weighs less than 70 kg, they are classified as “woman”. In more complicated examples, the number of nodes is increased and a large number of decision trees can be combined in order to make the desired prediction.

In the case of missing values, we essentially end up with estimated values, with which the missing ones are replaced, training the algorithm by feeding it available data. In this dataset, on which the statistical analysis for the specific story is based, we have included a common list of characteristics for each player (ranking positions, prize rewards, “backhand”, favorite surface, age turned professional, etc.). Each characteristic is a variable, i.e. a column in the dataset.

Let us assume we lack data on a player’s prize money. We have:

  • missing values ​​for prize money received by player a
  • available values ​​for other characteristics of player a
  • υφιστάμενες τιμές για κάποια από τα χαρακτηριστικά της λίστας, στην περίπτωση άλλων αθλητών 
  • missing values ​​for other characteristics on the list, in the case of other players

We classify the characteristics of players (variables), based on the number of missing values ​​occurring in their dataset in ascending order – starting with the player with the most complete dataset and finishing with the one with the most incomplete one.

Based on this classification, missing values for each player x and for each of his characteristics (variables) ​​are imputed by using the Random Forest algorithm, which is based on: a) available values ​​for other characteristics of player x and b) available values ​​of other players for the characteristic that player x has a missing value. A similar process is followed for each variable containing missing values ​​and repeated several times, until it meets specific performance criteria.

Finding similarities between players

Clustering players by similarity and selecting Tsitsipas’ cluster

 
#clear the R environment
rm(list = ls())
#load R libraries
library(missForest)
library(KRLS)#to compute RBF
library(irlba) #to compute partial eigen decomposition
library(spam)#to facilitate linear algebra computations
library(spam64)#to facilitate linear algebra computations

#read and clean the data
cleaned_data = read.csv("cleaned_raw_data.csv") 
names = cleaned_data$names
mydata = cleaned_data[,2:6,8:10] 

#log and logit transformations
mydata[,c(1:2)] = log(mydata[,1:2])
mydata[,c(3:5)] =log(mydata[,c(3:5)]/(1-mydata[,c(3:5)]))

#impute missing values
mydata.imp <- missForest(mydata)$ximp

#compute similarity matrix
mysimils =gausskernel(mydata.imp, sigma=1) #introduce sparsity to facilitate linear algebra computations by setting low similarities equal to 0
mysimils[which(mysimils<=10^(-5))]=0 

#compute regularized Laplacian of similarity matrix
diag(mysimils) =0
N= dim(mysimils)[1]
DD = rep(NA,N)
for(i in 1:N) DD[i] = sum(mysimils[i,])

DD = DD + mean(DD)
myd =diag.spam(1/sqrt(DD))
tSimils =myd%*%mysimils%*%myd

#compute partial spectral decomposition of Laplacian
K=150
myeigen = partial_eigen(as.dgCMatrix.spam(tSimils), n =K, symmetric = TRUE)
U = myeigen$vectors

#normalize eigenvectors to have unit length
scalar1 <- function(x) {x / sqrt(sum(x^2))}
Ustar = matrix(NA,dim(U)[1],dim(U)[2])
for(i in 1:dim(U)[1]){
  Ustar[i,] = scalar1(U[i,])

#run k-means
km = kmeans(Ustar,centers = 150,iter.max = 1000)

#find Tsitsipas cluster
size_clsts = km$size
clsts = which(size_clsts>=2)
same_players =list()
tsitsipas_clst=rep(0,length(clsts))
for(i in 1:length(clsts)){
  same_players[[i]] =  which( km$cluster==clsts[i] ) 
  if(length(  intersect(same_players[[i]],which(names=="Stefanos Tsitsipas")))>0)
  {
    tsitsipas_clst[i] = 1
  }
}
select_tsitsip_clst = same_players[[which(tsitsipas_clst>0)]]
select_tsitsip_clst=select_tsitsip_clst[-5] #we exclude the fifth member of the cluster because he seems to be an outlier due to two injuries that he had and he stopped and started again
tsitsipas_clst_data = cleaned_data[select_tsitsip_clst,]
tsitsipas_clst_names = names[select_tsitsip_clst]
The whole script, predictions included, is available on GitHub.

After replacing the missing values, the dataset of 1.602 rows (each player, one row) and 20 columns (each column, one player characteristic) was completed. For each possible pair of players, a similarity index was calculated, using the “radial base function” (RBF kernel). This similarity index is a positive number informing us about the degree of similarity between a player and any athlete in the dataset.

Subsequently, a “similarity matrix” is created, i.e. a matrix that integrates this similarity index to compare coupled players. It is a 1,602 x 1,602 table – rows and columns correspond to the study’s players, thus forming the possible pairs of players: in each cell of the similarity matrix, there is a value that corresponds to the similarity index between the two reference tennis players.

Next, in order for players who are similar to each other to be categorized in groups, we adopt the cluster analysis technique and, in particular, the technique of spectral clustering – the similarity matrix is quite large (1,602 x 1,602) and it would be easier to work with a matrix of smaller dimensions, which, however, would contain as much information as possible from the original similarity matrix.

We follow these steps:

We end up with 150 clusters of similar players and we choose the one Stefanos Tsitsipas belongs to, in order to work on predictions for the player’s ranks.

In fact, the cluster Tsitsipas – whose professional career spans five years now – belongs to, includes players who have had at least 15 years of experience. This is crucial for the prediction method, precisely because our prediction technique is based on pre-existing career paths of similar players.

Predictions method

Considering that the research question is “what world ranking positions Stefanos Tsitsipas might occupy in the future” and given that he has been a professional player for the last five years, we created a prediction model over a ten-year horizon by applying the statistical technique of the sample mean.

We predict the player’s performance over the next ten years, based on the average performance of similar players, who, however, have longer career spans.

In other words, we use actual ranking data of players Tsitsipas is most comparable with, based on their performance after a ten and a 15-year old career path, in order to create estimates for Tsitsipas’ performance over the next decade.

In order for the sample mean (that is, our prediction) not to be affected by any outliers ​​and despite the fact that the Tsitsipas cluster consists of only ten members (himself and nine more players), we are looking for sub-groups of similar athletes within the cluster Stefanos Tsitsipas was assigned to. By computing the Euclidean distance of each player from the centroid of the cluster (as calculated by the k-means algorithm itself), we find:

  • sub-group of players a, which is located farther from the centroid of the cluster – these are players who tend to rank slightly better compared to the average positions held by the players of this category during each season. Stefanos Tsitsipas also appears in sub-group a.  
  • sub-group of players b, which is close to the centroid of the cluster – that is, players whose world ranking positions are close to the average positions held by all players of the category in a given season

In this context, players belonging to the Tsitsipas cluster are selected from these two different sub-groups for test applications of the prediction model: by using data on the first five years of their career, we make “pseudo predictions” for their career path over the next years, which we compare to the actual data on their world ranking. In this way, the margin of error of our predictions emerges.

  • <iframe src="https://lab.imedd.org/iframes/TSITSIPAS/tsitsipas_cluster_true_ranks_en/index.html" height="250" width="100%" allow="fullscreen"></iframe>
The evolution of ranking for four selected players “similar” to Tsitsipas over the first five seasons of their career.

More specifically, for the purposes of testing the model, the following are selected:

  • Alexander Zverev, who, despite having only turned professional seven years ago, belongs to sub-group a (like Stefanos Tsitsipas). He also has a very similar trajectory to Tsitsipas and is one of his major rivals today.
  • Stan Wawrinka, David Ferrer and Thomas Berdych, who, on the one hand, have more than 15 years of professional experience and, on the other, belong to sub-group b

For Zverev (sub-group a), the test is applied using a sample mean that results only from the positions held by the cluster’s four players, who are the top-ranked players in a given season. For Wawrinka, Ferrer and Berdych (subgroup b), the “pseudo predictions” are made by calculating the sample mean of the total data pertaining to the cluster players in each season.

As shown in the graph below, the actual ranking of players is usually within our model’s prediction intervals.

  • <iframe src="https://lab.imedd.org/iframes/TSITSIPAS/tsitsipas_cluster_predictions_en/index.html" height="600" width="100%" allow="fullscreen"></iframe>
Testing our prediction model on four selected players from Tsitsipas’ cluster

For the predictions of Stefanos Tsitsipas’ world ranking performance in the next decade, the prediction method that was applied in the very similar case of Zverev is chosen: for the calculation of the sample mean, lower ranked players compared to the cluster median are excluded and the model takes into account the positions held by the top performing players each time.

Taking a closer look at Zverev’s case and the extent to which the chosen method of calculating the predictions affects the estimated trajectory of Tsitsipas, it is worth noting the following:

Although Zverev finished his sixth professional year ranked No 4., the model had predicted he would have come in 11th (mean prediction) in the respective season, which coincides with the first season of prediction. On the contrary, in the following seasons, the prediction model is “aligned” with Zverev’s actual achievements. The “failure” of the first prediction is explained, if one takes into account that Zverev’s trajectory is similar to that of the other players in the group after the third season of his career. The rapid ascent in world rankings, which characterized both Zverev and Tsitsipas in the first years of their career, is not observed in the case of otherwise identical athletes, whose data, however, “feeds” the model and affects the estimated value for the first prediction season.

It should be noted that a multivariate modelling strategy would be useful for making even more realistic predictions, which is why we suggest this strategy is applied in future studies: in tennis, an athlete’s trajectory significantly depends on their opponents’ progress. In the present model, the working hypothesis is that a player (p) will follow a similar trajectory as his peers’ (n), but it is not assumed that his opponents will have a similar performance to his peers’ (n) opponents considered here.

Data visualization

Charts that visualize statistical data analysis were created using the Python programming language and the statistical programming language R.

More specifically, the Plotly Python Open Source Graphing Library was used to create interactive charts, while static graphs were created using the ggplot2 package in R.

  • <iframe src="https://lab.imedd.org/iframes/TSITSIPAS/main_radar_plot_en/index.html" height="500" width="100%" allow="fullscreen"></iframe>
Radar plot representing Tsitsipas’ cluster and similarity between players. The more similar the shapes are, the more similar the players are.

We especially chose to visualize the Tsitsipas cluster and the similarities between the cluster’s players using a radar plot: radar plots are two dimensional charts suitable for the comparative visualization of multiple (quantitative) variables. Especially when more than one case study is visualized, it is the shape of the radar plot that serves the purpose of the visual comparison. In this context, the more similar the players’ shapes are, the more similar the players themselves are to each other. However, radar plots can become too complex and illegible when they include too many overlapping shapes – as in our case, where we needed to visualize data of at least ten athletes (Tsitsipas cluster). For this reason, the interactive design was chosen, in order for users to be able to deselect players or select specific players for comparison.

Chart: prize vs seasons
Hexbin plot showing prize money vs number of seasons players count

To visualize the distribution of the 1.602 players in the dataset based on – for example – their winnings in relation to seasons played, the “hexbin plot” was selected: the hexbin plot is essentially a different approach to the scatterplot, which is usually chosen to visualize the relationship between two quantitative variables. When the data is so large that the overlays do not yield a maximum of possible information, the “hexbin plot” is appropriate, as the collected data is plotted on hexagonal grids: in this case, the darker the blue color of each hexbin, the more players are concentrated on that point.


This is the result of a collaboration between iMEdD Lab and AUEB Sports Analytics Group research team aiming to promote robust quantitative analysis in sports, on both an academic and a professional level. The team works in the field of “Sports Analytics”, which includes the creation of statistical models and the production of predictions regarding sports results, draws on sports economics, and uses such tools as performance analysis, and visualization and measurement of competitive balance.




Translation: Anatoli Stavroulopoulou

Λογότυπο Άδειας Χρήσης Creative Commons Non Commercial International