NCAA Basketball Analysis Part 5 - A Simple Neural Network with Keras

Introduction

This is post four of a multi-part analysis of college basketball game outcomes. There are probably going to be things I mention in this post that I talked about in prior posts, so here are the links if you need to catch up.

One of the end products of this series of posts is going to be a handful of models that predict the outcome of a given game. I am going to start by attempting to classify wins and losses, and will progress to estimating the point differential. This post is geared specifically towards building a multi-layer perceptron for classification using keras. Neural networks require a lot of detail and can be dificult to learn. I am not going to go into crazy detail for some of the steps, but I will provide links to a few of the places that I found inspiration as well as great explanations.

Query and Preprocessing the Data

The data pull is the same as in my other posts. If you don’t have the data set, refer to the links to the previous posts above to see how I did it.

base_dir <- '~/ncaa_data'
db_connection <- dbConnect(
      drv = RSQLite::SQLite(), 
      dbname = file.path(base_dir, 'database.sqlite')
)
SELECT 
    *
FROM 
    clean_games

As of right now, we have a handfull of records that are incomplete, and two records that showed an infinite assist to turnover ratio. While obviously that is not possible, it only accounts for 6 records out of ~ 60 thousand records, so I just filtered them out. I may go back and try to troubleshoot it later…

ds <- ds %>%
  filter(complete.cases(.)) %>% 
  filter(
    is.finite(games.astto_season_avg_t1),
    is.finite(games.astto_season_avg_t2)
  ) %>% 
  mutate_at(
    .vars = vars(
      id,
      team1,
      team2,
      game_date,
      season_t1,
      games.outcome_t1,
      games.tourney_t1,
      games.home_t1,
      games.away_t1,
      games.neutral_t1
    ),
    .funs = ~as.factor(.)
  ) %>%
  select(
    -id,
    -games.neutral_t1,
    -game_date,
    -games.tourney_t1,
    -team1,
    -team2,
    -season_t2
  ) 

ds %>% 
  head(n = 5) %>% 
  kable('html') %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover")
  ) %>% 
  scroll_box(width = "100%")
season_t1 games.outcome_t1 games.home_t1 games.away_t1 games.scorediff_t1 games.scorediff_3game_t1 games.score_3game_t1 games.fgperc_3game_t1 games.opp_fgperc_3game_t1 games.fg3perc_3game_t1 games.opp_fg3perc_3game_t1 games.ftperc_3game_t1 games.ordiff_3game_t1 games.drdiff_3game_t1 games.astto_3game_t1 games.scorediff_season_avg_t1 games.score_season_avg_t1 games.fgperc_season_avg_t1 games.opp_fgperc_season_avg_t1 games.fg3perc_season_avg_t1 games.opp_fg3perc_season_avg_t1 games.ftperc_season_avg_t1 games.ordiff_season_avg_t1 games.drdiff_season_avg_t1 games.astto_season_avg_t1 games.scorediff_3game_t2 games.score_3game_t2 games.fgperc_3game_t2 games.opp_fgperc_3game_t2 games.fg3perc_3game_t2 games.opp_fg3perc_3game_t2 games.ftperc_3game_t2 games.ordiff_3game_t2 games.drdiff_3game_t2 games.astto_3game_t2 games.scorediff_season_avg_t2 games.score_season_avg_t2 games.fgperc_season_avg_t2 games.opp_fgperc_season_avg_t2 games.fg3perc_season_avg_t2 games.opp_fg3perc_season_avg_t2 games.ftperc_season_avg_t2 games.ordiff_season_avg_t2 games.drdiff_season_avg_t2 games.astto_season_avg_t2
2003 1 0 0 25 34.33333 79.33333 0.4897973 0.2450065 0.4306418 0.2555556 0.6882333 -8.000000 11.3333333 1.4471844 34.33333 34.33333 0.4897973 0.2450065 0.4306418 0.2555556 0.6882333 -8.000000 11.333333 1.4471844 -8.666667 65.66667 0.4687214 0.4975097 0.3115218 0.3751994 0.6616162 1.666667 1.666667 0.8249084 -8.666667 -8.666667 0.4687214 0.4975097 0.3115218 0.3751994 0.6616162 1.666667 1.666667 0.8249084
2003 1 1 0 8 -17.33333 57.66667 0.4125122 0.4885150 0.1690717 0.4459064 0.6388889 3.000000 0.3333333 0.6957418 -12.75000 -12.75000 0.4515410 0.4814656 0.2614191 0.4480662 0.6837121 0.500000 1.500000 0.8218063 -30.000000 54.33333 0.3277788 0.5102682 0.2004177 0.4719298 0.5996757 8.000000 -4.000000 0.6351236 -30.000000 -30.000000 0.3277788 0.5102682 0.2004177 0.4719298 0.5996757 8.000000 -4.000000 0.6351236
2003 1 1 0 23 14.00000 72.33333 0.4514137 0.3671233 0.2698148 0.3170445 0.6483008 3.000000 4.6666667 1.4924242 14.00000 14.00000 0.4514137 0.3671233 0.2698148 0.3170445 0.6483008 3.000000 4.666667 1.4924242 -11.000000 72.00000 0.4073730 0.5291736 0.3013291 0.4023810 0.7394444 2.000000 -8.000000 1.3755991 -11.000000 -11.000000 0.4073730 0.5291736 0.3013291 0.4023810 0.7394444 2.000000 -8.000000 1.3755991
2003 1 0 0 10 24.33333 84.00000 0.4758964 0.3787582 0.3489975 0.3157823 0.5640152 3.333333 6.0000000 1.3452381 23.50000 23.50000 0.5027556 0.3810074 0.3386712 0.2868367 0.5896780 1.500000 6.000000 1.2589286 17.333333 80.66667 0.4950017 0.4160751 0.3666667 0.3185185 0.6287538 6.333333 7.000000 1.0214931 17.333333 17.333333 0.4950017 0.4160751 0.3666667 0.3185185 0.6287538 6.333333 7.000000 1.0214931
2003 1 0 0 2 10.33333 77.00000 0.4682905 0.4485086 0.4533333 0.2208333 0.6829590 -1.333333 -3.3333333 1.1216330 10.33333 10.33333 0.4682905 0.4485086 0.4533333 0.2208333 0.6829590 -1.333333 -3.333333 1.1216330 15.666667 77.33333 0.4735300 0.3948347 0.5004810 0.3746499 0.6685315 5.333333 6.000000 1.0661376 8.750000 8.750000 0.4304900 0.3995743 0.4042069 0.3920985 0.6535725 7.500000 3.500000 0.9871032

As you can see above, we now have our few categorical variables one hot encoded, and the rest of our numbers we calculated in the previous posts. Now for a deep learning model, I typically center and scale my data. At first glance, I was going to apply this process while looking at the dataset as a whole. But the more I thought about this, the less that apprach made sense to me. This is mostly speculation, I don’t have the hard facts to back it up, but I would be willing to bet money that the game has changed a bit in the 14 season span of games we are looking at. To account for this, I wanted to center and scale the data, within each season. To do this, I used a for loop, and the recipes package. Small sidebar, I love the recipes package. So far, I have only used a handful of the steps that you can apply, but there are so many preprocessing steps you can apply. Even better, it is all contained in a single object, so I can move that recipe around and bake it wherever I need!

# Extracting list of seasons
seasons <- ds %>% 
  select(season_t1) %>% 
  distinct() %>% 
  magrittr::extract2(1)

# Initializing storage for preprocessed data
store <- tibble()

# Looping through each season in seasons
for (season in seasons) {
  
  # Extract season from ds
  tmp <- ds %>% filter(season_t1 == season)

  # Center and scale all numeric variables
  rec <- tmp %>% 
    recipe() %>% 
    step_center(all_numeric(), -games.scorediff_t1) %>% 
    step_scale(all_numeric(), -games.scorediff_t1) %>% 
    prep(.)

  # Append that to the store object
  store <- bind_rows(store, rec %>% bake(tmp))
  
  # Rinse and repeat until all seasons have been preprocessed
}

Now the store object is the same size as the ds object that we were previously looking at, except store is centered and scaled with depending on the season. Now I don’t want the season variable to be used for prediction, so I am going to drop that here. I also want my factors back to numerics. I do this here so I can assure the data looks correct before converting to a matrix. I also want to shuffle my data. The data set in its curent state is ordered by date. To avoid this, I use slice and sample.

# Remove season, and hacky way back to numerics from factors
store <- store %>% 
  select(-season_t1) %>% 
  mutate_if(is.factor, as.character) %>% 
  mutate_if(is.character, as.numeric) 

# Shuffling the data set
store <- slice(store, sample(1:n()))

Now that I have the data set preprocessed, I want to split to a testing and training set. Since I will be building another classification model, I wanted to store the training and testing set to ensure we are comparing apples to apples when comparing the models. I also want to ensure that I have a very similar proportion of team 1 wins in the training and testing data, which is taken care of with the strata parameter in the initial_split function

# Splitting with rsample
split <- initial_split(store, prop = 3/4, strata = 'games.outcome_t1')
training <- training(split)
testing <- testing(split)

# Writing these to the database
dbWriteTable(
      conn = db_connection,
      name = 'training',
      value = training,
      overwrite = TRUE
)
dbWriteTable(
      conn = db_connection,
      name = 'testing',
      value = testing,
      overwrite = TRUE
)

# Remember to disconnect from db...
dbDisconnect(db_connection)

Modeling

Okay, now we are ready to model this data. I am using keras in this script to create a simple, multi-layer perceptron. I have used a few different frameworks, using both R and Python, and I really like using keras in R. I really like the tidyverse style piping to create the network. I’m not going to get into the explanation of how neural networks work, but I will share some of my favorite resources down at the bottom. One thing I really like about keras is how easy it is to tweak the model architecture and hyperparameters. To begin, I will build a simple network, to demonstrate a few of the hyperparameters that I like to control. I want to start by getting our data set in a way that keras understands. Keras uses the ideas of tensors, and I am not going to get into the details here. For this example, we need to separate our dataset into our input tensor and a vector of targets. Our input tensor will be our predictor variables as columns, and observations as rows, and will need to be of type matrix. Our training dataframe contains our predictor variables, as well as two potential targets; games.outcome_t1 and games.scorediff_t1.

x_train <- training %>% 
  select(-games.outcome_t1, -games.scorediff_t1) %>% 
  as.matrix()

y_train <- training %>% 
  select(games.outcome_t1) %>% 
  magrittr::extract2(1)

x_test <- testing %>% 
  select(-games.outcome_t1, -games.scorediff_t1) %>% 
  as.matrix()

y_test <- testing %>% 
  select(games.outcome_t1) %>% 
  magrittr::extract2(1)

Now we start building by first initializing the model object.

# Initializing model object
model <- keras_model_sequential()

Building a NN in keras has a feels a bit different thatn a lot of the models I have built using more traditional machine learning type packages. Instead of feeding parameters into a function, we build the structure of the model by piping the model object into functions that add layers to the model. Here, we are going to build a densely-connected multilayer perceptron. To do this, we will add densely connected layers using the layer_dense function.

There are a few parameters that I like to use / tweak when building with the layer_dense function, and they are shown below.

  • object - this is required, and it is the model object initialized above.
  • units - this is required, and this is the number of hidden units in the layer.
  • activation - this is the activation function for this layer, no activation if nothing provided.
  • kernel_regularizer - this allows us to add L1, L2, or L1 and L2 regularization to the layer.
  • input_shape - only required on the first layer, and is the dimensionality of an observation of your input data.

Object is obvious, this is the model object, which we will pipe into the function. Units is a hyper-parameter that takes some searching. I will test out 16 hidden units here. I won’t give over-simplified explanations of the activation functions or regularization, but I will share some of my favorite articles about them below. One of the toughest concepts for me to understand at first was the input shape, and how the shape changes from layer to layer. The first layer requires the input_shape as a parameter, whereas keras is smart enough to take the output shape of the previous layer for the remaining densely connected layers. In our example, the input shape is the number of columns in x_train. This is because a single row of the input matrix maps to a single element of the target vector, y_train. Below is an example of adding a layer to a model.

# pass model object into layer_dense to add a densely connected layer
model %>% 
  layer_dense(
    # number of hidden units
    units = 16,
    # activation function
    activation = 'relu',
    # adding l2 regularization
    kernel_regularizer = regularizer_l2(l = 0.01),
    # providing the input shape
    input_shape = ncol(x_train),
  )

# Printing a summary of the model
summary(model)
- ___________________________________________________________________________
- Layer (type)                     Output Shape                  Param #     
- ===========================================================================
- dense_1 (Dense)                  (None, 16)                    688         
- ===========================================================================
- Total params: 688
- Trainable params: 688
- Non-trainable params: 0
- ___________________________________________________________________________

As you can see, there is 1 layer, with 688 parameters. Our input shape was 1 x 42. We told the layer to have 16 hidden units. For each of the 42 variables have a weight, or a parameter, for each of the 16 hidden units. This takes us to 672 parameters. The remaining 16 are bis terms, one for each hidden unit.

Now I want to add a dropout layer to the model. This will help us avoid overfitting the model. The only parameter I use for the dropout layer is the rate, and occationally the seed.

model %>% layer_dropout(rate = 0.1)

summary(model)
- ___________________________________________________________________________
- Layer (type)                     Output Shape                  Param #     
- ===========================================================================
- dense_1 (Dense)                  (None, 16)                    688         
- ___________________________________________________________________________
- dropout_1 (Dropout)              (None, 16)                    0           
- ===========================================================================
- Total params: 688
- Trainable params: 688
- Non-trainable params: 0
- ___________________________________________________________________________

As you can see, adding a dropout layer doesn’t change the output shape of the model, and doesn’t have any trainable parameters.

Now for the output layer. In this use case, we have a single output, the outcome of the game. This tells us how many units we need to have in the output layer. We need 1 unit per outcome. We are also going to switch up our activation function. For binary classification, we are going to use the sigmoid function. This will return a value between 0 and 1, representing the probability of team 1 winning the game.

model %>% 
  # adding a hidden layer
  layer_dense(
    units = 16,
    activation = 'relu',
    kernel_regularizer = regularizer_l2(l = 0.01)
  ) %>% 
  layer_dropout(rate = 0.1) %>% 
  # output layer
  layer_dense(
      # single output
      units = 1,
      # sigmoid activation for classification
      activation = 'sigmoid'
  )

summary(model)
- ___________________________________________________________________________
- Layer (type)                     Output Shape                  Param #     
- ===========================================================================
- dense_1 (Dense)                  (None, 16)                    688         
- ___________________________________________________________________________
- dropout_1 (Dropout)              (None, 16)                    0           
- ___________________________________________________________________________
- dense_2 (Dense)                  (None, 16)                    272         
- ___________________________________________________________________________
- dropout_2 (Dropout)              (None, 16)                    0           
- ___________________________________________________________________________
- dense_3 (Dense)                  (None, 1)                     17          
- ===========================================================================
- Total params: 977
- Trainable params: 977
- Non-trainable params: 0
- ___________________________________________________________________________

As you can see, dense_2 has 272 parameters. Recall, the output shape of dense_1 was 16, because of our 16 hidden units. These 16 outputs feed into the 16 units of dense_2. So 16 outputs * 16 hidden units + 16 bias parameters brings us to 272 trainable parameters in dense_2. The output layer then has 16 outputs from the previous layer, times 1 hidden unit, plus 1 bias term, for a total of 17 trainable parameters in the output layer, for a total of 977 trainable parameters in the network. And thats a very simple model architecture! Now it’s time to define the optimizer and the loss function, and compile the model.

There are a few decisions that have to be made when fitting the model object. One is the number of epochs to perform, and the other being the batch size. An epoch consists of one full pass of training data through the model. The batch size is how many observations are passed into the model, before the model recalculates its weights. There are a few methodologies for setting these values. I’ll put some good links below.

model %>% 
  compile(
      # using adam optimizer
      optimizer = 'adam',
      # binary crossentropy for classification
      loss = 'binary_crossentropy',
      metric = 'accuracy'
  )
history <- fit(
  object = model,
  x = x_train,
  y = y_train,
  batch_size = 100,
  epochs = 20,
  verbose = 0
)

Now that we trained a model object, we can visualize the accuracy and the loss by epoch. Keras can do this with the plot function, but I like using ggplot so I plotted it that way.

p1 <- tibble(
    accuracy = history$metrics$acc,
    loss = history$metrics$loss
) %>% 
    rowid_to_column('Epoch') %>% 
    ggplot(aes(x = Epoch, y = accuracy)) +
    geom_point(size = 2, alpha = 0.75, color = '#08F7FE') +
    geom_smooth(method = 'loess', se = FALSE, color = '#F5D300') +
    labs(
        x = 'Epoch',
        y = 'Accuracy',
        title = 'Training Metrics'
    ) +
    ggdark::dark_theme_minimal()

p2 <- tibble(
    accuracy = history$metrics$acc,
    loss = history$metrics$loss
) %>% 
    rowid_to_column('Epoch') %>% 
    ggplot(aes(x = Epoch, y = loss)) +
    geom_point(size = 2, alpha = 0.75, color = '#08F7FE') +
    geom_smooth(method = 'loess', se = FALSE, color = '#F5D300') +
    labs(
        x = 'Epoch',
        y = 'Loss'
    ) +
    ggdark::dark_theme_minimal() 

cowplot::plot_grid(p1, p2, nrow = 2)

We can also evaluate the performance of the model using yardstick. Yardstick has a handful of functions for calculating a variety of model metrics. For classification, a few common metrics are classification, precision, and recall. Typically, precision and recall are better ways for me to evaluate the performance of a model. The reason being, there is typically a considerable cost of mis-classifying observations. This use case is a bit of a special case. Both outcomes really represent the same thing, so both scenarios should carry the same weight. For that reason, I am going to focus on accuracy for my metric that I am trying to maximize.

pred_probs <- predict_proba(object = model, x = x_test) %>% 
    as.vector() %>% 
    round(2)
 
estimate <- as.factor(ifelse(pred_probs > 0.5, 1, 0))
  
tibble(y_test, estimate) %>% 
    yardstick::accuracy(truth = as.factor(y_test), estimate = estimate) %>% 
    kable('html') %>% 
    kable_styling(bootstrap_options = c("striped", "hover"))

Alright! So with now I want to get into the nitty-gritty on how I search for optimal models. One of the challenges we face is knowing what to set the hyper-parameters to. Most of the time, its a bit of trial and error. One way to do this is by doing a grid search. We will create a parameter space for all of the parameters that we are looking to optimize. In this example, we will search for a few of the hyper-parameters that we used to build the perceptron from above. Base R has a cool function, expand.grid, that makes this super easy. I am going to search for the optimal number of hidden units, the drop rate in the dropout layer, the learning rate of the adam optimizer, the L2 regularization factor, and the number of epochs.

hyper_grid <- expand.grid(
  units = c(16, 32, 64),
  drop_rate = c(0.1, 0.25, 0.5),
  adam_lr = c(0.0001, 0.00005),
  l2 = c(0.005, 0.01),
  epochs = c(25)
)

hyper_grid %>% 
    kable('html') %>% 
    kable_styling(bootstrap_options = c("striped", "hover")) %>% 
    add_header_above(c("Hyper Grid Scenarios" = 5)) %>% 
    scroll_box(height = '400px') 
Hyper Grid Scenarios
units drop_rate adam_lr l2 epochs
16 0.10 1e-04 0.005 25
32 0.10 1e-04 0.005 25
64 0.10 1e-04 0.005 25
16 0.25 1e-04 0.005 25
32 0.25 1e-04 0.005 25
64 0.25 1e-04 0.005 25
16 0.50 1e-04 0.005 25
32 0.50 1e-04 0.005 25
64 0.50 1e-04 0.005 25
16 0.10 5e-05 0.005 25
32 0.10 5e-05 0.005 25
64 0.10 5e-05 0.005 25
16 0.25 5e-05 0.005 25
32 0.25 5e-05 0.005 25
64 0.25 5e-05 0.005 25
16 0.50 5e-05 0.005 25
32 0.50 5e-05 0.005 25
64 0.50 5e-05 0.005 25
16 0.10 1e-04 0.010 25
32 0.10 1e-04 0.010 25
64 0.10 1e-04 0.010 25
16 0.25 1e-04 0.010 25
32 0.25 1e-04 0.010 25
64 0.25 1e-04 0.010 25
16 0.50 1e-04 0.010 25
32 0.50 1e-04 0.010 25
64 0.50 1e-04 0.010 25
16 0.10 5e-05 0.010 25
32 0.10 5e-05 0.010 25
64 0.10 5e-05 0.010 25
16 0.25 5e-05 0.010 25
32 0.25 5e-05 0.010 25
64 0.25 5e-05 0.010 25
16 0.50 5e-05 0.010 25
32 0.50 5e-05 0.010 25
64 0.50 5e-05 0.010 25

The hyper_grid dataframe has 36 rows, or scenarios to test. What I want to do now is define a loop to a model for each of scenarios defined above. I also want to test out a few more hidden layers. Before I do that, I want to define a few things. First, I am going to create a functions that defines a 3 layer model and a function that fits and evaluates the model. I also will create a tibble to store the results. First, lets look at what the model build function looks like. Putting this into a function is not necessary, but I prefer it because it makes the loop code a lot cleaner and easier to understand.

model_build <- function(hyper_grid_i, input_shape) {
  model <- keras_model_sequential()
  
  model %>% 
    layer_dense(
      units = hyper_grid_i$units,
      activation = 'relu',
      kernel_initializer = 'normal',
      kernel_regularizer = regularizer_l2(l = hyper_grid_i$l2),
      input_shape = input_shape
    ) %>% 
    layer_dropout(hyper_grid_i$drop_rate) %>% 
    layer_dense(
      units = hyper_grid_i$units,
      activation = 'relu',
      kernel_initializer = 'normal',
      kernel_regularizer = regularizer_l2(l = hyper_grid_i$l2),
    ) %>% 
    layer_dropout(hyper_grid_i$drop_rate) %>% 
    layer_dense(
      units = hyper_grid_i$units,
      activation = 'relu',
      kernel_initializer = 'normal',
      kernel_regularizer = regularizer_l2(l = hyper_grid_i$l2),
    ) %>% 
    layer_dropout(hyper_grid_i$drop_rate) %>% 
    layer_dense(
      units = 1,
      activation = 'sigmoid',
      kernel_initializer = 'normal',
      kernel_regularizer = NULL
    ) %>% 
    compile(
      optimizer = optimizer_adam(lr = hyper_grid_i$adam_lr),
      loss = 'binary_crossentropy',
      metric = 'accuracy'
    )
  
  return(model)
  
}

Now that we have a function that returns an untrained model object, we are going to make a function that trains the model, predicts wins or losses for the test set, and calculates the accuracy.

fitNeval <- function(model, x_train, y_train, x_test, y_test, epochs_i) {
  fit(
      object = model,
      x = x_train,
      y = y_train,
      batch_size = 100,
      epochs = epochs_i,
      verbose = 0
  )
  
  pred_probs <- predict_proba(object = tmp_model, x = x_test) %>% 
      as.vector() %>% 
      round(2)
  
  estimate <- as.factor(ifelse(pred_probs > 0.5, 1, 0))
  
  
  if (length(unique(estimate)) == 2) {
    accuracy <- tibble(y_test, estimate) %>% 
    yardstick::accuracy(truth = as.factor(y_test), estimate = estimate) %>% 
    magrittr::extract2(3)
  } else {
    accuracy <- 0
  }
  
  return(accuracy)
}

Now to put this all together, we are going to use these functions to build models for each combination of hyperparameter setting.

results <- tibble(i = integer(), accuracy = double())  

for (i in 1:nrow(hyper_grid)) {
  
  tmp_model <- model_build(hyper_grid[i,], ncol(x_train))
  
    tmp_accuracy <- fitNeval(
        tmp_model, 
        x_train = x_train, 
        x_test = x_test, 
        y_train = y_train, 
        y_test = y_test,
        epochs_i = hyper_grid$epochs[i]
    )
  
  
    results <- bind_rows(results, tibble(i = i, accuracy = tmp_accuracy))
}

Now we can search the results and look for patterns in the hyper-parameters, but I’m just going to take the top accuracy here. I can then take those parameters and fit our final model.

# Extracting to hyper-parameters
top_params <- results %>% arrange(desc(accuracy)) %>% slice(1)

# Build model with optimal hyper-parameters
model <- model_build(top_params, input_shape = ncol(x_train))

# Fit the model 
history <- fit(
  object = model,
  x = x_train,
  y = y_train,
  batch_size = 100,
  epochs = 25,
  verbose = 0
)

testing <- testing %>% 
  mutate(
    T1_win_prob = predict_proba(object = model, x = x_test),
    T1_win_class = ifelse(T1_win_prob >= 0.5, 1, 0)
  ) %>% 
  select(
    T1_win_prob,
    T1_win_class,
    games.outcome_t1,
    everything()
  )

testing %>% 
    yardstick::accuracy(truth = as.factor(games.outcome_t1), estimate = as.factor(T1_win_class))

After we were all said and done, we achieved around 72% accuracy. Part of this series of posts goes along with a shiny application, and I want to use this model object in the app. Luckily, keras makes this really easy!

save_model_hdf5(model, '~/ncaa_data/keras_classification.h5')

And thats it! I can now take this trained model object, and use it wherever I need.

Conclusion

After modeling this data, we found a 72% prediction accuracy. I am relatively pleased with this attempt. Predicting the outcome of college basketball games is a pretty complex task, and upsets happen all the time. Think about the 2018 Virginia team that was the overall number 1 in the tournament and they end up losing to a 16 seed, which had never happened! And it seems that almost every year, there is at least one 12-5 upset. I do think with a bit more feature engineering, we could get better results. One thing the data does not have factored in is the strength of schedule for the teams, or how they perform versus ‘good’ teams.

Resources

First and foremost, Deep Learning with R by JJ Allaire was WILDLY helpful in learning keras, I highly recommend the read if you are looking to get into keras as an R user. As for additional resources, I found these posts to be very helpful as well.

As a last thought here, I had some crazy dificulties getting the Rmd to render the html file at the end of this. When devoloping, I ran all the code inline, and everyting works great, but when it tries to knit, no dice. If you have any adive on this, please reach out, it would be greatly appreciated.

Avatar
Kip Brown
Data Scientist