This is a reproducible tutorial on identifying player and team types in the NBA. Standard player positions are extremely outdated and often carry little meaning when describing what a player will actually do when he/she steps on the court. The motivation behind this project is to identify new clusters of NBA players which better align with what our eyes see each night on the court. Teammates LeBron James and Anthony Davis are both technically forwards but are used in completely different ways. The NBA is transitioning towards a much more free-flowing style of play where centers are shooting 3-pointers with frequency and point guards are 6'7". Teams will have to start using more descriptive methods for identifying the types of players they wish to build their lineup around.

Multiple popular clustering algorithms will be described and implemented using phsyical characteristics of NBA players obtained from the NBA Combine, as well as on-court tendencies of players on both sides of the ball. This tutorial requires minimal background in statistics and R. It was developed for anyone with interest in either learning more about play styles in the NBA or sports analytics in general. Hopefully this can serve as a base for anyone who wants to learn some new things in R or get ideas about how to get creative with sports data. There will be some math for those interested but I'll try to keep it as light as possible.

I'll be going over K-Means clustering, Principal Component Analysis, Model Based Clustering, and Networks/Graphs. I'll also do a fair amount of plotting and try to comment as many of the plotting functions as possible. Almost all of them allow for some level of customization which is always helpful to practice. Visualization is everything in a field where you have to spend a large amount of time generating buy-in to your statistical analysis.

The Data

The data used throughout this tutorial comes from the NBA's API and Basketball Reference ( nbastatR is a package developed by Alex Bresler which scrapes from both of these sites. Documentation for the package and its functions can be found here.

# install and load libraries
# devtools::install_github("abresler/nbastatR")
# if you don't have any of these, install using install.packages('package_name') 
# font_import(prompt=FALSE)

This function pulls data from Basketball Reference. The data frame will contain individual player stats on a per minute basis from each of the last ten seasons, as well as the total number of minutes played. This is only the data I wanted to grab for this project but there are dozens more features you can pull from this function yourself.

Let's check out the distribution of total minutes our players played in each season.

# define personal theme for rules you want every graph to have (this saves time / space) 
theme_stern <- function() {
  theme(text = element_text(family='Tahoma', color="#232D4B"), # set font and color of all text
        axis.title = element_text(face='bold'), # make all axis titles bold
        plot.title = element_text(face='bold', hjust=0.5), # make all plot titles bold
        legend.title = element_text(face='bold'), # make all legend titles bold
        plot.subtitle = element_text(face='italic', hjust=0.5)) # make all subtitles italic

player_stats %>%
  ggplot(aes(x=minutesTotals)) + 
  geom_histogram(aes(y=..density..), position='identity', # generate histogram of data
                 fill="#232D4B", alpha=0.9, bins=30) + 
  geom_density(alpha=0.2, fill="#F84C1E", color="#F84C1E") + # generate kernel density estimate
  labs(y = "Density", x = "Total Minutes Played", # assign axis titles
       title = "Distribution of Total Minutes Played in a Season") + # assign plot title
  scale_x_continuous(breaks=seq(-500, 3000, 500)) + # manually set x-axis ticks
  theme_minimal() + theme_stern() + # apply themes
  theme(axis.ticks.y = element_blank(), axis.text.y = element_blank(), # manually adjust theme
        axis.text.x = element_text(size=10)) 

There looks to be a drop-off so lets only keep players' seasons in which they played at least \(1000\) minutes, leaving us with ~\(1500\) player seasons. This ensures player clusters will be representative of the types of players you see most frequently on an NBA court.

# subset data to players who played >1000 minutes in a season
player_stats <- player_stats %>%
  dplyr::filter(minutesTotals > 1000, idPlayerNBA!=1628977, idPlayerNBA!=1628998) # remove players who's ID repeats

# vector of unique player IDs
unique_playerIDs <- unique(player_stats$idPlayerNBA)

This function will pull data from the NBA Draft Combine. Again, there's way more than just what I'm pulling here so be sure to check it out yourself.

# pull player heights from the NBA Draft Combine
combine <- draft_combines(years=2000:2019,
                          nest_data=FALSE, return_message=FALSE) %>%
  # select only the columns we want 
                # heightWOShoesInches, weightLBS, wingspanInches, 
                verticalLeapMaxInches, timeLaneAgility,
                repsBenchPress135) %>%
  # limit to players in the on-court stats dataset
  dplyr::filter(idPlayer %in% unique_playerIDs) %>%
  # rename ID column for merging purposes later

K-Means Clustering

K-Means is an unsupervised clustering algorithm. Unsupervised means that it operates without the input of a response variable. Unlike a regression model or any type of prediction problem, K-Means is only concerned with groupings of various sizes based on the values of the predictors. The groupings are determined using the distances between observations, attempting to maximize the distance between clusters and minimize the distance within clusters. The most common distance formula used is Euclidean distance: \[distance(a, b) = \sqrt{\sum^n_{i=1} (a_i - b_i)^2}\] This is the same straight-line distance formula you learned in Geometry class but allows the algorithm to work on any number of dimensions and therefore take into account any number of predictors. This also means that all predictors must be on the same scale. Scaling ensures that when distances are calculated, all dimensions have a mean of zero. While the mean of each predictor has been centered, the spread of each predictor remains intact. Predictor variables which are compact around their mean will have less influence over the distance calculation than those with a wider spread of potential values. In this example, if 3PT% is a more widely distributed skill than 2PT%, being better or worse at it will have a greater effect on the player's distance from other players. The function scale() scales quantitative predictors using the z-score formula for the normal distribution: \[x_{new} = \frac{(x_i - \bar{x})}{\sigma_x}\] This tutorial will not cover them, but qualitative predictors can be used as well. It is suggested that qualitative predictors be normalized based on the moments of either the binomial or dirichlet distributions. The algorithm begins by taking in a pre-specified \(k\) which dictates the number of clusters it should create. It places the centers of each of these clusters in random locations across \(m\) dimensions, where \(m\) is the number of predictor variables. From here the algorithm repeats the same two steps:

  1. The distance from each observation to each center is calculated and observations are assigned to the cluster they are closest to.
  2. Cluster centers are re-initialized to the mean value of each predictor variable for the observations assigned to the cluster.

These steps are repeated until no observation changes cluster membership in step (1) from its previous membership in the previous iteration.

The random nature of the beginning of the K-Means algorithm does mean that different clusters can be created using the same data and the same \(k\). Improvements have been made in recent years and more robust algorithms for determining initial cluster centers have been developed. I won't get into those here but the most popular of these is K-Means++ and more information on it can be found here.

Since all predictors are scaled in the same way, the distance calculation makes all predictors equally important. Therefore, the selection of predictor variables makes all the difference when running K-Means. For this project, a mix of physical traits and on-court actions were used. Physical traits are extremely important in basketball. They help determine who you are able to defend, how you move around the court, and how physical you can be on offense. For shooting, a decision must be made to either use makes or attempts. I chose to use per minute player tendencies rather than results because they provide a better idea of what a player's game is. What is a player doing while they are on the court? Made shots will always be more noisy than attempts due to the luck involved in putting a ball into a hoop from \(10\)-\(30\) (or \(40\) if you're Damian Lillard) feet away. Attempts also partially encompass talent due to the natual selection bias which occurs. Coach isn't going to let you take \(25\)+ shots a game if he/she doesn't think you're going to hit some of them. This decision should lead to groupings of players that make a lot of natural sense to frequent viewers of the game. Players in the same cluster will ideally be those with similar playing styles and have similar roles within their respective offenses and defenses.

Physical Traits:

  • Vertical - maximum vertical leap (in.)
  • Agility - lane agility test score (sec.)
  • Strength - number of 135lb bench press reps

On-Court Tendencies

  • 3PA - 3-point shots attempted per minute
  • 2PA - 2-point shoes attempted per minute
  • FTA - free throws attempted per minute (How often is this player drawing fouls while shooting?)
  • ORB - offensive rebounds per minute
  • DRB - defensive rebounds per minute
  • AST - assists per minute
  • STL - steals per minute
  • BLK - blocks per minute
  • TOV - turnovers per minute
  • PTS - total points scored per minute
  • 3PT% - percentage of 3-point shots made
  • 2PT% - percentage of 2-point shoes made
  • FT% - percentage of free throws made
# describe which stats were picked 
player_stats <- player_stats %>%
  # join player on-court stats with NBA combine metrics
  inner_join(combine, by='idPlayerNBA') %>%
  # remove players with NA values for Combine metrics 
  dplyr::filter(# !, !, !, 
                !, ! 

X <- player_stats %>% 
  # select predictor variables define above
  dplyr::select(pctFG3, pctFG2, pctFT, 
                fg3aPerMinute, fg2aPerMinute, ftaPerMinute, 
                orbPerMinute, drbPerMinute, astPerMinute, 
                stlPerMinute, blkPerMinute, tovPerMinute, 
                ptsPerMinute,  verticalLeapMaxInches, 
                timeLaneAgility, repsBenchPress135
                ) %>%
  # scale data 

Since the K-Means algorithm fits only to the number of clusters specified, some testing must be done to find the optimal \(k\). The optimal \(k\) will do the best job at both ensuring observations within the same cluster are as similar as possible and observations in different clusters are as different as possible. Here I'll fit a K-Means algorithm for \(1\) to \(20\) clusters and save the total within sum of squares value from the model. This is a measure of how similar each of the observations within a cluster are.

set.seed(222) # set seed to ensure reproduceability b/c k-means relies on random states for initialization 
MAX_K <- 20 # max number of clusters
sse <- c() # vector to hold SSE of each model

for (k in 1:MAX_K) {
  algo_k <- kmeans(X, centers=k, nstart=22, iter.max=20) # k-means algorithm
  sse <- c(sse, algo_k$tot.withinss) # get SSE

How does the total within cluster distance change as \(k\) ranges from \(1\) to \(20\)?

tibble(k = 1:MAX_K, SSE = sse) %>%
  ggplot(aes(x=k, y=SSE)) + 
  geom_point(color="#F84C1E") + geom_line(color="#232D4B") + # set color of point and lines
  labs(x = "K", y = "SSE", title = "Where does this level off?") + # set axis/plot titles
  scale_x_continuous(breaks=seq(1, MAX_K, 1)) + # define x-axis
  theme_minimal() + theme_stern() + # add themes
  theme(panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank()) # manually alter theme

Looks like this dips between \(9\) and \(11\) but it can be tough to find the exact point where the above chart levels off. Another thing you can look at is the difference between the total within cluster distance of \(k\) and \(k+1\). Where does this chart level off?

tibble(k = 1:MAX_K, SSE_difference = sse-lead(sse)) %>%
  dplyr::filter(k<MAX_K) %>%
  ggplot(aes(x=k, y=SSE_difference)) + 
  geom_point(color="#F84C1E") + geom_line(color="#232D4B") + # set color of point and lines
  labs(x = "K", y = "SSE Rolling Difference", title = "A Clearer Picture") + # set axis/plot titles
  scale_x_continuous(breaks=seq(1, MAX_K, 1)) + # define x-axis
  theme_minimal() + theme_stern() + # add themes
  theme(panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank()) # manually alter theme