NCAA Basketball Analysis Part 3 - Data Preprocessing and Feature Engineering

Mar 21, 2019 8 min read Data Science, R, Tidyverse

Introduction

This is post three of a multi-part analysis of college basketball game outcomes. Part one shows how the data was aquired and loaded, and part two is a brief exploratory data analysis exploring a few teams for the 2005 season. Links to those can be found below.

Post 1 Post 2

This post will focus on some data munging and feature engineering. The data set is currently not set up very well for modeling. In future posts, I would like to model the score differential with a variety of regression techniques, as well as using a few classification algorithms to classify wins vs losses. The do that, I need the data to be in a much more structured format.

I will show how I calculated each teams three game rolling averages and cumulative season averages going into that game. I will query the data using the same method as in posts 1 and 2.

SELECT 
    *
FROM
    games

Recall, the data is currently structured as box scores and individual game statistics. I want the input to be the teams season and 3 game rolling statistics going into that game. To do this, I will rely heavily on dplyr.

ds <- 
  ds %>% 
  mutate(Dayzero = mdy(Dayzero)) 

# Extracting vector of seasons in ds
seasons <- ds$Season %>% unique() %>% as.character()

# Extracting vector of teams
teams <- 
  c(as.character(ds$winner), as.character(ds$Loser)) %>% 
  unique() %>% 
  sort()

The code above grabs a list of teams and seasons that I will now iterate through. Now, I know everyone says for loops should be avoided in R if possible, but this seemed like an exception to me. The idea is that I will iterate through each season in the data set, and nested inside each season iteration, I will iterate through each team in the data set.

# Initialize the storage data frame
data <- data.frame()

# Looping through each season ----
for (i in 1:length(seasons)) {
  
  # Extract season i of the vector
  temp_season <- as.character(seasons[i])
  
  # Loop through each team ----
  for (j in 1:length(teams)) {
    
    # Extract team j of the vector
    temp_team <- as.character(teams[j])
  
    # Creating a temporary data frame for this i,j iteration
    temp_df <-
      ds %>% 
      filter(
        Season == temp_season,
        (winner == temp_team | Loser == temp_team)
      ) %>% 
      # Current game munging
      mutate(
        opponent = as.character(
          case_when(
            winner == temp_team ~ Loser,
            TRUE ~ winner
          )
        ),
        outcome = case_when(
          winner == temp_team ~ 1,
          TRUE ~ 0
        ),
        tourney = case_when(
          type == 'post' ~ 1,
          TRUE ~ 0
        ),
        home = case_when(
          winner == temp_team & Wloc == 'H' ~ 1,
          winner != temp_team & Wloc == 'A' ~ 1,
          TRUE ~ 0
        ),
        away = case_when(
          winner == temp_team & Wloc == 'A' ~ 1,
          winner != temp_team & Wloc == 'H' ~ 1,
          TRUE ~ 0
        ),
        neutral = case_when(
          Wloc == 'N' ~ 1,
          TRUE ~ 0
        ),
        score = case_when(
          winner == temp_team ~ Wscore,
          TRUE ~ Lscore
        ),
        opp_score = case_when(
          winner == temp_team ~ Lscore,
          TRUE ~ Wscore
        ),
        fga = case_when(
          winner == temp_team ~ Wfga,
          TRUE ~ Lfga
        ),
        opp_fga = case_when(
          winner == temp_team ~ Lfga,
          TRUE ~ Wfga
        ),
        fgm = case_when(
          winner == temp_team ~ Wfgm,
          TRUE ~ Lfgm
        ),
        opp_fgm = case_when(
          winner == temp_team ~ Lfgm,
          TRUE ~ Wfgm
        ),
        fga3 = case_when(
          winner == temp_team ~ Wfga3,
          TRUE ~ Lfga3
        ),
        opp_fga3 = case_when(
          winner == temp_team ~ Lfga3,
          TRUE ~ Wfga3
        ),
        fgm3 = case_when(
          winner == temp_team ~ Wfgm3,
          TRUE ~ Lfgm3
        ),
        opp_fgm3 = case_when(
          winner == temp_team ~ Lfgm3,
          TRUE ~ Wfgm3
        ),
        fta = case_when(
          winner == temp_team ~ Wfta,
          TRUE ~ Lfta
        ),
        ftm = case_when(
          winner == temp_team ~ Wftm,
          TRUE ~ Lftm
        ),
        or = case_when(
          winner == temp_team ~ Wor,
           TRUE ~ Lor
        ),
        dr = case_when(
          winner == temp_team ~ Wdr,
          TRUE ~ Ldr
        ),
        opp_or = case_when(
          winner == temp_team ~ Lor,
          TRUE ~ Wor
        ),
        opp_dr = case_when(
          winner == temp_team ~ Ldr,
          TRUE ~ Wdr
        ),
        ast = case_when(
          winner == temp_team ~ Wast,
          TRUE ~ Last
        ),
        to = case_when(
          winner == temp_team ~ Wto,
          TRUE ~ Lto
        ),
        blk = case_when(
          winner == temp_team ~ Wblk,
          TRUE ~ Lblk
        ),
        pf = case_when(
          winner ==  temp_team ~ Wpf,
          TRUE ~ Lpf
        ),
        game_date = as.character(Dayzero + days(Daynum)),
        scorediff = score - opp_score,
        fgperc = fgm / fga,
        opp_fgperc = opp_fgm / opp_fga,
        fg3perc = fgm3 / fga3,
        opp_fg3perc = opp_fgm3 / opp_fga3,
        ftperc = ftm / fta,
        ordiff = or - opp_or,
        drdiff = dr - opp_dr,
        astto = ast / to,
        # Lagging stats to the next game
        scorediff_lag = lag(
          x = scorediff,
          n = 1,
          default = NA
        ),
        score_lag = lag(
            x = score,
            n = 1,
            default = NA
        ),
        fgperc_lag = lag( 
            x = fgperc,
            n = 1,
            default = NA
        ),
        opp_fgperc_lag = lag(
          x = opp_fgperc,
          n = 1,
          default = NA
        ),
        fg3perc_lag = lag(
            x = fg3perc,
            n = 1,
            default = NA
          ),
        opp_fg3perc_lag = lag(
          x = opp_fg3perc,
          n = 1,
          default = NA
        ),
        ftperc_lag = lag(
            x = ftperc,
            n = 1,
            default = NA
          ),
        ordiff_lag = lag(
            x = ordiff,
            n = 1,
            default = NA
          ),
        drdiff_lag = lag(
            x = drdiff,
            n = 1,
            default = NA
          ),
        astto_lag = lag(
            x = astto,
            n = 1,
            default = NA
          ),
        # 3 game rolling averages
        scorediff_3game = rollmean(
          x = scorediff_lag,
          k = 3, 
          fill = NA,
          align = 'right'
        ),
        score_3game = rollmean(
          x = score_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        fgperc_3game = rollmean(
          x = fgperc_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        opp_fgperc_3game = rollmean(
          x = opp_fgperc_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        fg3perc_3game = rollmean(
          x = fg3perc_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        opp_fg3perc_3game = rollmean(
          x = opp_fg3perc_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        ftperc_3game = rollmean(
          x = ftperc_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        ordiff_3game = rollmean(
          x = ordiff_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        drdiff_3game = rollmean(
          x = drdiff_lag,
          k = 3,
          fill = NA,
          align = 'right'
        ),
        astto_3game = rollmean(
          x = astto_lag,
          k = 3,
          fill = NA,
          align = 'right'
        )
      ) %>% 
      rowid_to_column('id') %>% 
      # Removing game 1 because no input stats
      filter(id > 1) %>% 
      # Calculating cumulative season averages
      mutate(
        scorediff_season_avg = cummean(scorediff_lag),
        score_season_avg = cummean(scorediff_lag),
        fgperc_season_avg = cummean(fgperc_lag),
        opp_fgperc_season_avg = cummean(opp_fgperc_lag),
        fg3perc_season_avg = cummean(fg3perc_lag),
        opp_fg3perc_season_avg = cummean(opp_fg3perc_lag),
        ftperc_season_avg = cummean(ftperc_lag),
        ordiff_season_avg = cummean(ordiff_lag),
        drdiff_season_avg = cummean(drdiff_lag),
        astto_season_avg = cummean(astto_lag)
      ) %>% 
      # Removing games 1 and 2 bc incomplete 3 game rolling stats
      filter(id > 3) %>% 
      select(
        opponent:neutral,
        game_date,
        scorediff,
        scorediff_3game,
        score_3game,
        fgperc_3game,
        opp_fgperc_3game,
        fg3perc_3game,
        opp_fg3perc_3game,
        ftperc_3game,
        ordiff_3game,
        drdiff_3game,
        astto_3game,
        scorediff_season_avg,
        score_season_avg,
        fgperc_season_avg,
        opp_fgperc_season_avg,
        fg3perc_season_avg,
        opp_fg3perc_season_avg,
        ftperc_season_avg,
        ordiff_season_avg,
        drdiff_season_avg,
        astto_season_avg
      )
  
    # If that team didn't have data for that season, jump to next iteration
    
    if (nrow(temp_df) == 0) {
      rm(temp_team, temp_df)
      next
    }
    
    # append existing data store
    data <- bind_rows(
      data,
      data.frame(
        team = temp_team,
        season = temp_season,
        games = temp_df
      )
    )
    
    rm(temp_team, temp_df)
  }
}

That beast of a pipeline above gets me close to what I want to be able to do, but now each game is split into two rows in that data set. I will want to join those back together to create something clean for modeling. I also want to be careful not to introduce any bias into the data set.

The plan is to look at the original dataset, extracting just the winning team, losing team, and the game date. Then, I can reference the data frame of statistics that was just calculated. I want the input data to be in the terms of team 1 and team 2. My first thought was to simply rename winning team and losing team to team 1 and team 2 respectively. After thinking about it a bit more, team 1 would always be the winner in that scenario. You don’t need a deep learning model to figure that one out. To circumvent this issue, I randomly split the data set in half. One half oriented the winning team to team 1 and the other set the losing team to team 1.

data <- data %>% 
  mutate(games.game_date = ymd(games.game_date))

# Creating True/False vector
split <- sample(
  c(TRUE, FALSE), 
  size = nrow(ds), 
  replace = TRUE, 
  prob = c(0.5, 0.5)
)

# Pull out group 1 and modify
g1 <- ds[split,] %>% 
  mutate(game_date = Dayzero + days(Daynum)) %>% 
  select(winner, Loser, game_date) %>% 
  rename(team1 = winner, team2 = Loser) 

# Pull out group 2 and modify
g2 <- ds[!split,] %>% 
  mutate(game_date = Dayzero + days(Daynum)) %>% 
  select(winner, Loser, game_date) %>% 
  rename(team1 = Loser, team2 = winner) 

# Binding groups back together
games <- bind_rows(g1, g2) %>% rowid_to_column('id')

# Joining supplementary info back on
clean <- games %>%
  inner_join(
    y = data,
    by = c('team1' = 'team', 'game_date' = 'games.game_date')
  ) %>% 
  inner_join(
    y = data,
    by = c('team2' = 'team', 'game_date' = 'games.game_date'),
    suffix = c('_t1', '_t2')
  ) %>% 
  select(
    -games.opponent_t1,
    -games.opponent_t2,
    -games.outcome_t2,
    -games.tourney_t2,
    -games.home_t2,
    -games.away_t2,
    -games.neutral_t2,
    -games.scorediff_t2
  )

# Writing to our sqlite
dbWriteTable(
  conn = db_connection,
  name = 'clean_games',
  value = clean,
  overwrite = TRUE
)

Now we have the data ready to go! In the posts to come, we are going to look at a few clustering methods, some PCA, and then on to modeling!

Tidyverse DBI

NCAA Basketball Analysis Part 3 - Data Preprocessing and Feature Engineering

Introduction

Kip Brown

Data Scientist

Related