With the All-Star break I took advantage to carry out a new statistical study. Most of the regular season has been played so we can predict the different end of season awards such as Most Valuable Player (MVP), Rookie of the Year (ROY), All-Rookie Teams and All-NBA Teams. Regarding the first 2 awards, Giannis Antetokounmpo (Milwaukee Bucks) and Ja Morant (Memphis Grizzlies) are announced favorites but some players are chasing them. This article is going to be divided in 5 parts :
- Data and evaluation
- MVP and ROY predictions
- All-Rookie Teams prediction
- All-NBA Teams prediction
All the 3 predictions parts are fundamentally using the same approach : building an advanced model based on statistics to predict the probability for each player of winning the award.
Data & evaluation Metrics
All the datas comes from both stats.nba.com and basketball-reference.com. For each player since 1996-1997 season (oldest season available) I have collected the following stats per game:
- Basic stats : games played, minutes played, winrate, rebounds, assists …
- Shooting stats : FG%, FT%, eFG%, TS% …
- Defense stats : steal%, def rebound% …
- Advanced stats : win share, offensive and defensive ratings, value over replacement player …
- Target : rewarded or not
At the end, we have a total of 112 features (with many interconnected).
The problem is that each season we have a lot more players that are not rewarded than players that are which creates an umbalanced data set. For example if we consider there are 400 players containing 100 rookies per season, 1 over 400 wins the MVP, 1 over 100 wins the ROY, 10 over 100 wins All-Rookie and 15 over 400 wins All-NBA. In terms of percentage it is <1%, 1%, 10% and 4% which is really a few. Therefore, I used an over-sampling technique to create new « fake » samples to balance the classes. I have chosen the SMOTE (Synthetic Minority Over-sampling Technique). For each player of the minority class (players rewarded in our case), it observes the nearest neighbors then it creates new sample on the segment [origin – neighbor]. Here is an explaining diagram :
To put it simply, this technique creates fake awarded players based on real awarded players.
Then, we need to evaluate our model predictions. I have chosen Accuracy because it is important to have a good prediction rate and also but above all Recall because we want that if a player is predicted awarded then it should be the case.
I have used the same model for all the predictions : a Fully connected Neural Network (NN) with 5 layers using ReLu activation functions for the first 4 and sigmoid for the last one to get a probability. To avoid overfitting, I have used L1 regularization (adds 1-norm coefficient as penalty to our cost function). This L1 penalty allows to shrink to 0 the unused features.
I have fitted my model with every season between 1996 and 2018 and I have tested it on the season 2018-2019. The dataset has been split on to a train and validation sets with a 75-25 repartition. Then I applied the SMOTE for train set only. I have trained my models on 100 epochs (1 epoch means the model has seen our entire base once). For example the following graph is the loss for the All-NBA Team.
We can use early stopping.
MVP and ROY
Last year, I had already made this prediction with a few features and I had correct predictions : both Giannis and Doncic were 2018-2019 MVP and ROY. This year I have used a lot more features (112) plus I found that it could be interesting to add a feature to modelize popularity. I have therefore collected the odds at the beginning of the season.
Applying the model defined previously we obtain a probability to win the award for each player. I have displayed only the top players with the highest probability.
Giannis Antetokounmpo should win it back to back with a 69% probability (note that according to bookmakers it is 76,5%) followed by Luka Doncic doing a huge season and James Harden a recurring candidate.
Ja Morant (2nd pick of the Draft) should succeed to Luka Doncic as Rookie of the Year with a probability of 46% according to my model (while the bookmakers give him 79% !) followed by Kendrick Nunn and Zion Williamson. In my opinion Zion will end up second at the end of the season given his level lately.
Using the same features and the same model as earlier, we get a probability for each player to be in the All-Rookie Teams. For each position, the player with the highest probability takes the spot.
from left to right (G-G-F-F-C)
1st Team : Ja Morant (Memphis Grizzlies), Kendrick Nunn (Miami Heat), Rui Hachimura (Washington Wizards), PJ Washington (Charlotte Hornets) and Zion Williamson (New-Orleans Pelicans)
2nd Team : Tyler Herro (Miami Heat), RJ Barrett (New York Knicks), DeAndre Hunter (Atlanta Hawks), Eric Paschall (Golden State Warriors) and Brandon Clarke (Memphis Grizzlies)
Missed the cut : Matisse Thybulle (Philadelphia 76ers), Terence Davis (Toronto Raptors), Coby White (Chicago Bulls)
We do the same as for All-Rookie Teams.
1st Team : James Harden (Houston Rockets), Luka Doncic (Dallas Mavericks), LeBron James (Los Angeles Lakers), Giannis Antetokounmpo (Milwaukee Bucks) and Anthony Davis (Los Angeles Lakers)
2nd Team : Damian Lillard (Portland Trail Blazers), Russel Westbrook (Houston Rockets), Kawhi Leonard (Los Angeles Clippers), Pascal Siakam (Toronto Raptors) and Nikola Jokic (Denver Nuggets)
3rd Team : Ben Simmons (Philadelphia 76ers), Devin Booker (Phoenix Suns), Jimmy Butler (Miami Heat), Jayson Tatum (Boston Celtics) and Rudy Gobert (Utah Jazz).
Missed the cut : Joël Embiid (Philadelphia 76ers), Khris Middleton (Milwaukee Bucks), Kemba Walker (Boston Celtics).
It was really close between Lillard and Doncic and between Gobert and Embiid.