I guess that we are all sad being at home without any sports to watch but I am back with an other Machine Learning application ! Today we are going to England and its famous Premier League !
I have tried to elect the U23 players that could be part of the Team of the Year. Premier League is not the most popular league for youngsters but trust me there are really talented boys.
As there is no official Youth Team of the Year I have built my own strategy : create a model to predict the PFA Team of the Year (official one) and then use it to predict the youngsters with the highest probability of being in that team. The only constraints are to be less (or equal) than 23 years old and have played at least 1000 minutes this season.
The data comes from many websites to have a wide range of statistics for each player.
- Shooting stats : xG, Shots on target etc.
- Passing stats : assists, pass %, key passes etc.
- Defense stats : interceptions, clearances etc.
- Goalkeeper stats : saves, clean sheets etc.
- Target : nominated or not at the PFA TotY
At the end we have a hundred of features for each player (less for goalkeepers of course).
We observe the same problem I have already explained in my last article (http://dunkthedata.com/nba-mvp-roy-all-rookie-and-all-nba-teams-predictions/ ) : we have a lot more not nominated players each season than nominated ones. Therefore we need to use an over-sampling technique. I have deciced to chose an other one than SMOTE because one of his weaknesses is that it does not take into account neighbours from the other class. It may creates not homogeneous synthetic samples.
I went for the Adaptative Synthetic Sampling (ADASYN) where more synthetic samples are generated for minority class samples that are harder to learn. In other words, to simplify, it generates more or less samples depending on the neighbours of a selected points with K the number of neighbours an hyper parameter. It means that it will create more samples for isolated samples of the minority samples.
Below there is an example with two cases with 3 neighbours.
As we can see in case 1 selected sample S has no neighbours from his class (r = 1) so ADASYN will generate more samples. While in case 2 it has 1 neighbour so it is easier to learn.
The model is consisting of a Random Forest classifier for each role on the squad : Goalkeeper, Defenders, Midfielders and Forwards.
After the model is trained and tested with a 75/25 repartition I predict using the statistics of 2019-2020 season (before the COVID19 stop). The selected players are simply the ones with the highest probability.
Here is the U23 Squad of the Season !
Market values according Transfermarkt
Aaron Ramsdale (Bournemouth) : 14 millions
Trent Alexander-Arnold (Liverpool) : 110 millions
Joe Gomez (Liverpool) : 42 millions
Caglar Söyüncü (Leicester) : 40 millions
Ben Chilwell (Leicester) : 50 millions
Ruben Neves (Wolves) : 50 millions
Mason Mount (Chelsea) : 45 millions
Youri Tielemans (Leicester) : 55 millions
James Maddision (Leicester) : 60 millions
Gabriel Jesus (Manchester City) : 70 millions
Marcus Rashford (Manchester United) : 80 millions