Premier League U23 Squad of the Season predictions

I guess that we are all sad being at home without any sports to watch but I am back with an other Machine Learning application ! Today we are going to England and its famous Premier League !

I have tried to elect the U23 players that could be part of the Team of the Year. Premier League is not the most popular league for youngsters but trust me there are really talented boys.

As there is no official Youth Team of the Year I have built my own strategy : create a model to predict the PFA Team of the Year (official one) and then use it to predict the youngsters with the highest probability of being in that team. The only constraints are to be less (or equal) than 23 years old and have played at least 1000 minutes this season.

The Data

The data comes from many websites to have a wide range of statistics for each player.

  • Shooting stats : xG, Shots on target etc.
  • Passing stats : assists, pass %, key passes etc.
  • Defense stats : interceptions, clearances etc.
  • Goalkeeper stats : saves, clean sheets etc.
  • Target : nominated or not at the PFA TotY

At the end we have a hundred of features for each player (less for goalkeepers of course).

We observe the same problem I have already explained in my last article ( ) : we have a lot more not nominated players each season than nominated ones. Therefore we need to use an over-sampling technique. I have deciced to chose an other one than SMOTE because one of his weaknesses is that it does not take into account neighbours from the other class. It may creates not homogeneous synthetic samples.

I went for the Adaptative Synthetic Sampling (ADASYN) where more synthetic samples are generated for minority class samples that are harder to learn. In other words, to simplify, it generates more or less samples depending on the neighbours of a selected points with K the number of neighbours an hyper parameter. It means that it will create more samples for isolated samples of the minority samples.

Below there is an example with two cases with 3 neighbours.

As we can see in case 1 selected sample S has no neighbours from his class (r = 1) so ADASYN will generate more samples. While in case 2 it has 1 neighbour so it is easier to learn.

The Model

The model is consisting of a Random Forest classifier for each role on the squad : Goalkeeper, Defenders, Midfielders and Forwards.

After the model is trained and tested with a 75/25 repartition I predict using the statistics of 2019-2020 season (before the COVID19 stop). The selected players are simply the ones with the highest probability.

Here is the U23 Squad of the Season !


Market values according Transfermarkt

Aaron Ramsdale (Bournemouth) : 14 millions

Trent Alexander-Arnold (Liverpool) : 110 millions

Joe Gomez (Liverpool) : 42 millions

Caglar Söyüncü (Leicester) : 40 millions

Ben Chilwell (Leicester) : 50 millions

Ruben Neves (Wolves) : 50 millions

Mason Mount (Chelsea) : 45 millions

Youri Tielemans (Leicester) : 55 millions

James Maddision (Leicester) : 60 millions

Gabriel Jesus (Manchester City) : 70 millions

Marcus Rashford (Manchester United) : 80 millions