Open Data

Performance and characteristics of 1,602 tennis players

Λογότυπο της ερευνητικής ομάδας Sports Analytics του Οικονομικού Πανεπιστημίου Αθηνών
Lab's Methodology Go to the story

Data on 1,602 tennis players regarding their demographic and technical characteristics, performance and prize money earnings over the last 20 years.

The following dataset was created in the context of the topic “What statistics can tell us about the career of Stefanos Tsitsipas“: the purpose was to analyze the data by applying simple machine learning and statistical techniques in order to predict Stefanos Tsitsipas’ world rankings within the next decade.

Primary data was collected in March 2020 from the Ultimate Tennis Statistics website, which is licensed under a Creative Commons license (CC BY-NC-SA 4.0) and is based on open source software available on GitHub.

More specifically, we collected annual data for the period 2000 to March 2020 from the Association of Tennis Professionals (ATP) world rankings, as published on the Ultimate Tennis Statistics website. We then proceeded to collect profile data published on the same website, regarding 3,912 tennis players included in these rankings.

Indicatively, as can be inferred from Stefanos Tsitsipas’ stats profile on the website, each athlete’s “profile data” includes, among other things, information about the age of the player, the year he turned professional, the seasons he has played, his “backhand“, his favorite surface, the amount of prize money he has earned, the titles he has won, the best rank he has achieved, as well as his current rank, his ELO rank etc.

The original dataset, created as a combination of the above elements, could be described as a matrix with p columns and n rows – each row, one athlete, and each column one variable (characteristic). After computing the proportion of missing values across the dataset’s rows and the columns, we found that in several cases, missing values ​​corresponded to more than 50% of row and column contents. We therefore deleted said rows and columns, since the information they would provide would be negligible.

The result of the data cleaning process is the dataset listed below, where our analysis on the clustering of players was based.

Feliciano Lopez777932.27120.520.510.490.65112.
Nicolas Mahut513113.95370.440.410.30.62172.053.0175.
Tommy Robredo636963.5750.60.560.660.54101.
Paolo Lorenzi351977.79330.380.360.420.299.0415.0134.
Ivo Karlovic468826.9140.520.520.420.6393.08.0128.
Roger Federer5618777.8710.820.830.760.8716.
Guillermo Garcia Lopez470796.06230.460.420.50.48501.0132.0115.
Jo Wilfried Tsonga1379538.4450.680.680.640.69399.0106.0231.
Fernando Verdasco913358.4770.570.540.610.55291.
Andreas Seppi587859.78180.480.460.50.57444.0113.
Philipp Kohlschreiber674084.05160.560.540.570.6512.039.0120.
Teymuraz Gabashvili291151.07430.370.360.390.2771.0596.
Rafael Nadal6294819.010.830.780.920.78611.0151.
Jurgen Melzer519159.480.510.510.520.53190.
Dustin Brown264191.91640.390.340.40.45198.0293.0198.
Stan Wawrinka1877237.7230.640.640.670.5489.03.0114.
Richard Gasquet953971.4270.630.620.620.6768.
David Ferrer1749106.1730.660.640.70.63197.0150.
Go Soeda135495.27470.380.370.310.24249.0133.0200.035.033311.816692010943502
Carlos Berlocq296433.93370.410.330.460.25292.0364.
Marcel Granollers772940.57190.450.410.480.45445.0119.0110.
Gilles Muller315361.79210.520.530.460.56305.0280.
Dudi Sela259822.33290.420.450.190.48214.049.0138.
Daniel Gimeno Traver227631.36480.360.250.420.1619.
Julien Benneteau530930.11250.480.490.410.47149.018.0115.
Novak Djokovic10.830.840.80.84493.0108.
Yen Hsun Lu282170.35330.420.420.240.44351.02.0103.
Viktor Troicki575913.2120.520.530.510.52181.0449.0126.
Florian Mayer485266.13180.480.440.50.59480.0143.0215.
Rogerio Dutra Silva166942.0630.320.290.360.01138.0144.0351.
Janko Tipsarevic506824.9480.530.540.520.53453.
Ruben Ramirez Hidalgo167582.5500.340.190.390.01179.
Mischa Zverev403370.93250.40.410.320.535.026.0444.
Mikhail Youzhny713222.580.540.550.510.5755.
Gael Monfils1054753.8860.640.660.610.59686.0209.
Andy Murray4102933.810.770.780.70.84129.0347.
Marco Chiudinelli134908.0520.350.350.260.416.0109.
Marcos Baghdatis557432.3180.560.570.430.58376.038.0104.
Gilles Simon869037.8860.580.580.580.57310.
Sergiy Stakhovsky314649.47310.450.460.380.48198.0151.
Simone Bolelli360726.07360.430.360.480.52354.019.0123.
Stephane Robert218865.73500.340.360.310.33573.0103.
Radek Stepanek597024.4280.560.560.550.6265.0479.
Jaroslav Pospisil179194.671030.120.330.010.01157.099.0338.039.031112.096228035777099
Lukasz Kubot590473.08410.430.360.470.5147.
Jan Mertl300402.51630.670.990.587.052.0229.
Daniel Munoz De La Nava138074.38680.240.30.22205.0241.0276.
Frank Dancevic117914.62650.380.380.050.48233.
Lamine Ouahab38605.171140.50.380.53361.076.0114.
Giovanni Lapentti49029.01100.340.460.240.33505.0142.0149.
Fabio Fognini841913.3890.540.480.590.51415.058.0152.
Flavio Cipolla203031.25700.350.380.360.12319.
Filippo Volandri263308.73250.440.130.540.1548.060.0106.
Santiago Giraldo61.57280.450.390.510.43390.0152.
Adrian Menendez Maceiras199829.831110.
Jan Hernych143583.67590.40.420.320.4893.
Maximo Gonzalez203559.91580.320.190.370.0150.0250.0265.
Michal Przysiezny103031.92570.
Albert Montanes345078.82220.470.30.530.32109.
Nicolas Almagro672014.6290.590.470.660.47106.0582.
Konstantin Kravchuk114634.22780.
Lleyton Hewitt1043996.710.70.70.640.766.
Tomas Berdych1734784.040.650.650.630.68285.
Igor Sijsling216377.5520.360.350.30.44101.0481.
Teodor Dacian Craciun218141.0253.
Denis Istomin372861.12330.470.460.450.54662.
Steve Darcis220764.13380.470.470.470.46114.0215.0330.
Kevin Anderson1163846.7150.590.610.530.59101.0249.0296.
Dmitry Tursunov394675.0200.510.520.40.57146.0146.0222.
Michael Berrer212823.15420.380.410.30.29127.0152.
Tobias Kamke216612.27640.380.370.360.4693.0271.0235.
Paul Henri Mathieu370534.88120.480.460.490.42125.0114.
Frederico Gil145393.3620.470.390.550.12344.0125.
Malek Jaziri294278.0420.410.420.430.3168.0201.0158.
Ernests Gulbis453547.06100.510.50.520.38277.
Sam Querrey794143.47110.560.560.450.64485.
Blaz Kavcic163640.09680.390.370.420.2458.060.0106.
Toshihide Matsui98480.672610.670.6237.072.0223.041.033111.4976155642471
Juan Monaco577459.79100.560.460.630.39315.0146.0251.
Robin Haase508178.29330.460.440.510.4502.
Michael Russell156858.0600.340.370.260.368.070.0345.
Jimmy Wang87971.23850.460.460.290.39516.0197.
Lukas Lacko232969.07440.40.420.20.42190.092.0186.
Tommy Haas680499.3520.630.640.590.6315.
Pablo Andujar411965.85320.410.290.510.12611.0146.
Adrian Mannarino531070.77220.460.460.260.59457.0122.0190.
Matthias Bachinger142133.73850.360.390.370.22316.0159.
Lukas Rosol300705.64260.440.390.520.4328.
Jeremy Chardy577229.53250.50.480.530.48303.069.0117.
Matteo Viola123883.81180.240.330.010.01142.0154.0402.
Marc Fornell Mestres2360.010.0194.0178.066.038.0311
Pablo Cuevas530023.94190.530.410.60.41480.0124.0117.
Potito Starace315379.17270.460.310.530.0884.0200.
Victor Hanescu306932.21260.450.340.540.41265.040.0102.
Ivo Klec68838.171840.220.20.25282.0309.0264.
Leonardo Mayer499417.46210.480.430.530.46453.
Carlos Salamanca55554.831370.380.010.75197.048.0446.
Donald Young308544.0380.40.420.220.3859.0394.
Alejandro Falla193833.19500.40.390.430.39291.0148.0104.

Dataset columns explained

  • names: player’s name
  • prize_money: total prize money earnings ($)
  • best_rank: best position held in the world rankings
  • overall_surface_pct: overall success rate (all courts included)
  • hard_surface_pct: success rate on hard courts
  • clay_surface_pct: success rate on clay courts
  • grass_surface_pct: success rate on grass courts
  • ranksDiff1: first rapid ascent in the world rankings (difference in places held between first and second season)
  • ranksDiff2: second rapid ascent in the world rankings (difference in places held between third and second season)
  • ranksDiff3: third rapid ascent in the world rankings (difference in places held between fourth and third season)
  • best_rank_std: best position in the world rankings adjusted (according to years of professional activity)
  • age_turned_pro: age the player turned professional
  • age: ηλικία του αθλητή
  • plays1: whether the player is left-handed or right-handed
  • backhand2: one-handed vs two-handed backhand players
  • favorite_surface3: player’s favorite court surface
  • hard_titles_std: titles on hard courts, adjusted according to years of experience
  • clay_titles_std: titles on clay courts, adjusted according to years of experience
  • titles_std: titles the player has won in total, adjusted according to years of professional activity
  • prize_money_std: prize money the player has received, adjusted according to years of experience

1 Decoding: 1=Unknown, 2=Left-handed, 3=Right-Handed

2 Decoding: 1=Unknown, 2=One-Handed, 3=Two-Handed

3 Decoding: 1=Unknown, 2=All-Rounder, 3=Carpet, 4=Clay, 5=Fast (H, G, Cp), 6=Fast (H, G), 7=Fastest (G, Cp), 8=Firm (H, Cp), 9=Grass, 10=Hard, 11=Non-Carpet, 12=Non-Grass, 13=Non-Hard, 14=None, 15=Slow (H, Cl), 16=Soft (Cl, G)

This is the result of a collaboration between iMEdD Lab and AUEB Sports Analytics Group research team aiming to promote robust quantitative analysis in sports, on both an academic and a professional level. The team works in the field of “Sports Analytics”, which includes the creation of statistical models and the production of predictions regarding sports results, draws on sports economics, and uses such tools as performance analysis, and visualization and measurement of competitive balance.

Translation: Anatoli Stavroulopoulou

Λογότυπο Άδειας Χρήσης Creative Commons Non Commercial International