NCAA Basketball Analysis Part 7 - Regression and Classification with XGBoost

Introduction

So up to this point, we pulled down some college basketball data, feature engineered the heck out of it, and used keras to predict wins/losses as well as predicting the point differential. In this post, I will show the implimentation of the xgboost algorithm for both a regression and a classification task, the same tasks that we tackled in the last two posts. Here are links to the other posts in this series if you need reference.

Similar to neural networks, xgboost is a bit mathematically complex. Fortunately, the creators of xgboost have some really good documentation and examples out there, which can be found here.

Data Preparation

I am going to be using the same training and testing set that we used in the keras examples, which we have stored in our database.

base_dir <- '~/ncaa_data'

# Connect to Database
db_connection <- dbConnect(
  drv = RSQLite::SQLite(), 
  dbname = file.path(base_dir, 'database.sqlite')
)

# Query Training and Testing Set
training <- dbGetQuery(
  conn = db_connection,
  statement = 'SELECT * FROM training'
)
testing <- dbGetQuery(
  conn = db_connection,
  statement = 'SELECT * FROM testing'
)

# Disconnect From the Database
dbDisconnect(db_connection)

Similar to keras, xgboost accepts data as input data and labels. Here is what we did in the last two posts.

# Input Data
x_train <- training %>% 
  select(-games.outcome_t1, -games.scorediff_t1) %>% 
  as.matrix()

x_test <- testing %>% 
  select(-games.outcome_t1, -games.scorediff_t1) %>% 
  as.matrix()

# Score Differential Labels
y_train_scorediff <- training %>% 
  select(games.scorediff_t1) %>% 
  magrittr::extract2(1)

y_test_scorediff <- testing %>% 
  select(games.scorediff_t1) %>% 
  magrittr::extract2(1)

# Game Outcome Labels
y_train_outcome <- training %>% 
  select(games.outcome_t1) %>% 
  magrittr::extract2(1)

y_test_outcome <- testing %>% 
  select(games.outcome_t1) %>% 
  magrittr::extract2(1)

There is one more function that we need to to pass xgboost data. The xgb.DMatrix function wants x to be passed as a matrix, and the targets y as a vector. We want to create this object for both the training and testing sets.

train_scorediff <- xgb.DMatrix(
  as.matrix(x_train), 
  label = y_train_scorediff
)

test_scorediff <- xgb.DMatrix(
  as.matrix(x_test),
  label = y_test_scorediff
)

train_outcome <- xgb.DMatrix(
  as.matrix(x_train),
  label = y_train_outcome
)

test_outcome <- xgb.DMatrix(
  as.matrix(x_test),
  label = y_test_outcome
)

Building a XGBoost model

Similar to the keras posts, I am going to do a grid search in an attempt to optimize the hyperparameters. To do this, we need to have a good idea of what parameters we can feed to the model. There are quite a few things that you can tweak with xgboost. The documentation provided is awesome, so I suggest reading through it if you want to start using xgboost. I will give a brief description of the paramaters I will be using.

  • booster - This indicates which booster you want to use. For this example, I am going to use gbtree, but there are a few others you can use.
  • objective - This is where you can indicate whether we are doing a regression task, or a classification task.
  • eta - This is the learning rate. This concept is very similar to the learning rate we set when using that adam optimizer. This is bounded by 0 and 1.
  • gamma - This controls the minimum loss required to create another partition. This has a range of 0 to infinity.
  • max_depth - This limits the maximum depth of the tree This has a range of 0 to infinity, but the deeper the tree, the higher potential for overfitting.
# Setting Parameter List for models
class_params <- expand.grid(
    booster = "gbtree", 
    objective = "binary:logistic", 
    eta = c(0.1, 0.3), 
    gamma = c(0, 1, 5), 
    max_depth = c(5, 10, 15) 
)

reg_params <- expand.grid(
    booster = "gbtree", 
    objective = "reg:linear", 
    eta = c(0.1, 0.3), 
    gamma = c(0, 1, 5), 
    max_depth = c(5, 10, 15) 
)

One of the super nice things about the xgboost package is the cross validation function that it comes with. This will help me estimate the number of rounds we should allow the model to perform. I want to run the cross validation for every row of the grid search and save the results.

# Initializing Results Storage
results_class <- tibble(
  i = integer(), 
  iter = double(), 
  test_error_mean = double()
)

# Looping through each parameter combination
for (i in 1:nrow(class_params)) {
  set.seed(1224)
  
  # Fit Model
  tmp_cv_booking <- xgb.cv(
      params = list(
        booster = class_params$booster[i],
        objective = class_params$objective[i],
        eta = class_params$eta[i],
        gamma = class_params$gamma[i],
        max_depth = class_params$max_depth[i]
      ),
      data = train_outcome,
      nrounds = 100,
      nfold = 8,
      showsd = FALSE,
      verbose = FALSE,
      stratified = TRUE,
      print_every_n = 10,
      early_stopping_rounds = 15,
      maximize = FALSE
  )
  
  # Best Stopping Point
  tmp_best_iter <- tmp_cv_booking$best_iteration
  
  # Results at that stopping point
  results_i <- tmp_cv_booking$evaluation_log[tmp_best_iter, c(1, 4)]
  results_i$i <- i
  
  # Binding results to result storage
  results_class <- bind_rows(results_class, results_i)
  rm(tmp_cv_booking)
}

# Extracting Best Row from Grid
top_class <- results_class %>% 
  arrange(test_error_mean) %>% 
  dplyr::slice(1) 
best_class_params <- class_params[top_class$i,]
best_class_params$stop <- top_class$iter

The code above searches the grid of parameters, extracts the best parameters in terms of error, and stores the optimal stopping point for that configuration. I want to do the same thing for regression.

results_reg <- tibble(
  i = integer(), 
  iter = double(), 
  test_rmse_mean = double()
)

for (i in 1:nrow(reg_params)) {
  set.seed(1224)
  tmp_cv_booking <- xgb.cv(
      params = list(
        booster = reg_params$booster[i],
        objective = reg_params$objective[i],
        eta = reg_params$eta[i],
        gamma = reg_params$gamma[i],
        max_depth = reg_params$max_depth[i]
      ),
      data = train_scorediff,
      nrounds = 100,
      nfold = 8,
      showsd = FALSE,
      verbose = FALSE,
      stratified = TRUE,
      print_every_n = 10,
      early_stopping_rounds = 15,
      maximize = FALSE
  )
  
  tmp_best_iter <- tmp_cv_booking$best_iteration
  results_i <- tmp_cv_booking$evaluation_log[tmp_best_iter, c(1, 4)]
  results_i$i <- i
  
  results_reg <- bind_rows(results_reg, results_i)
  rm(tmp_cv_booking)
}

# Extracting Best Row from Grid
top_reg <- results_reg %>% 
  arrange(test_rmse_mean) %>% 
  dplyr::slice(1) 
best_reg_params <- reg_params[top_reg$i,]
best_reg_params$stop <- top_reg$iter

Now we want to fit the models with the parameters found above, using the full training set.

xgb_class <- xgb.train(
  params = list(
    booster = as.character(best_class_params$booster),
    objective = best_class_params$objective,
    eta = best_class_params$eta,
    gamma = best_class_params$gamma,
    max_depth = best_class_params$max_depth
  ),
  data = train_outcome,
  nrounds = best_class_params$stop,
  maximize = FALSE,
  verbose = FALSE
)

xgb_reg <- xgb.train(
  params = list(
    booster = as.character(best_reg_params$booster),
    objective = best_reg_params$objective,
    eta = best_reg_params$eta,
    gamma = best_reg_params$gamma,
    max_depth = best_reg_params$max_depth
  ),
  data = train_scorediff,
  nrounds = best_reg_params$stop,
  maximize = FALSE,
  verbose = FALSE
)

Model Evaluation

Now that we have trained models, I want to evaluate them the same way I evaluated the keras models, using yardstick. For the regression model, I looked at mae again, and for the classification, the accuracy. First. Lets add the predicted values to the testing data set.

testing <- testing %>% 
  bind_cols(
    tibble(
      pred_prob_win = predict(xgb_class, test_outcome),
      pred_scorediff = predict(xgb_reg, test_scorediff)
    )
  ) %>% 
  mutate(
    pred_win = if_else(pred_prob_win >= 0.5, 1, 0)
  ) %>% 
  select(
    games.outcome_t1,
    pred_win,
    pred_prob_win,
    games.scorediff_t1,
    pred_scorediff,
    everything()
  )

Okay, now we can obtain some metrics.

mae <- testing %>% 
  mae(truth = games.scorediff_t1, estimate = pred_scorediff)

accuracy <- testing %>% 
  mutate(
    games.outcome_t1 = as.factor(games.outcome_t1),
    pred_win = as.factor(pred_win)
  ) %>% 
  accuracy(truth = games.outcome_t1, estimate = pred_win)

bind_rows(mae, accuracy) %>% 
  kable('html') %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover")
  )
.metric .estimator .estimate
mae standard 8.920011
accuracy binary 0.715711

These results are very similar to those obtained with neural networks. I do think that even the best model archtecures would struggle to get much better than these numbers using the data set provided to them. I do believe that more feature engineering could increase these numbers.

Cool Feature of XGBoost

One of the things I really like about using xgboost is the variable importance plot you can obtained once you have trained a model. XGBoost makes this very easy.

feature_names <- training %>% 
  select(-games.outcome_t1, -games.scorediff_t1) %>% 
  colnames()

xgb.ggplot.importance(
  xgb.importance(
    feature_names = feature_names,
    model = xgb_reg
  )
)

This plot gives us some insight as to what was important for the model to be able to accurately predict information. Even cooler, if you use the xgb.ggplot.importance function, it will cluster your variables based on importance. Not surprisingly, the average score differentials for the two teams was the most importing information for the model to have, and is the top cluster of variables. Next, was knowing if it was a home or an away game, also not surprising. It was interesting to see that assist to turnover ratio, offensive rebound differential, and 3 point shooting percentage emerged in a cluster above the other variables. Another cool feature is the tree plotting function. This one is a bit large, and I am only showing the first tree in the ensemble. You can use the trees parameter to show more.

xgb.plot.tree(
  feature_names = feature_names,
  model = xgb_reg,
  trees = 0
)
%0 1 Tree 0 games.away_t1 Cover: 47509 Gain: 1019523.69 2 games.scorediff_season_avg_t1 Cover: 26171 Gain: 370403.188 1->2 < 0.5 3 games.scorediff_season_avg_t2 Cover: 21338 Gain: 380146.938 1->3 4 games.scorediff_season_avg_t2 Cover: 14729 Gain: 257210 2->4 < 0.232682303 5 games.scorediff_season_avg_t2 Cover: 11442 Gain: 307935.125 2->5 6 games.scorediff_season_avg_t1 Cover: 11832 Gain: 181128.359 3->6 < 0.131410852 7 games.scorediff_season_avg_t1 Cover: 9506 Gain: 242080.062 3->7 8 games.scorediff_season_avg_t1 Cover: 7441 Gain: 71215.2812 4->8 < -0.23254618 9 games.scorediff_season_avg_t1 Cover: 7288 Gain: 49566.0547 4->9 10 games.scorediff_season_avg_t2 Cover: 5970 Gain: 80528.9375 5->10 < 0.366272688 11 games.scorediff_season_avg_t1 Cover: 5472 Gain: 37953.3789 5->11 12 games.scorediff_season_avg_t2 Cover: 4528 Gain: 41762.4062 6->12 < -0.543643177 13 games.scorediff_season_avg_t1 Cover: 7304 Gain: 53744.5 6->13 14 games.scorediff_season_avg_t2 Cover: 4278 Gain: 71756.125 7->14 < -0.00130379805 15 games.scorediff_season_avg_t2 Cover: 5228 Gain: 46682.7969 7->15 16 games.scorediff_season_avg_t2 Cover: 2430 Gain: 18825.3906 8->16 < -0.885985136 17 games.scorediff_season_avg_t2 Cover: 5011 Gain: 39108.2969 8->17 18 games.scorediff_season_avg_t2 Cover: 1146 Gain: 6162.36719 9->18 < -1.03658342 19 games.scorediff_season_avg_t2 Cover: 6142 Gain: 45798.6055 9->19 20 games.scorediff_season_avg_t1 Cover: 1497 Gain: 21804.75 10->20 < -0.862801373 21 games.scorediff_season_avg_t1 Cover: 4473 Gain: 48926.5 10->21 22 games.scorediff_season_avg_t2 Cover: 4093 Gain: 25701.0605 11->22 < 1.41836286 23 games.scorediff_season_avg_t2 Cover: 1379 Gain: 18307.0547 11->23 24 games.scorediff_season_avg_t1 Cover: 2103 Gain: 13945.7012 12->24 < -0.710159421 25 games.scorediff_season_avg_t1 Cover: 2425 Gain: 17317.875 12->25 26 games.scorediff_season_avg_t2 Cover: 4720 Gain: 45463.1055 13->26 < 0.335834563 27 games.scorediff_season_avg_t2 Cover: 2584 Gain: 14072.2891 13->27 28 games.scorediff_season_avg_t1 Cover: 2993 Gain: 27864.3438 14->28 < 1.03044367 29 games.scorediff_season_avg_t1 Cover: 1285 Gain: 18785.375 14->29 30 games.scorediff_season_avg_t1 Cover: 3137 Gain: 30535.4766 15->30 < 0.985780597 31 games.scorediff_season_avg_t1 Cover: 2091 Gain: 24512.6875 15->31 32 Leaf Cover: 950 Value: 0.358464777 16->32 < -1.12034738 33 Leaf Cover: 1480 Value: -0.211681291 16->33 34 Leaf Cover: 1369 Value: 1.12653291 17->34 < -1.12042487 35 Leaf Cover: 3642 Value: 0.499423563 17->35 36 Leaf Cover: 748 Value: -0.812416553 18->36 < 0.488051355 37 Leaf Cover: 398 Value: -1.30275691 18->37 38 Leaf Cover: 3882 Value: -0.059078034 19->38 < 0.557851076 39 Leaf Cover: 2260 Value: -0.625254333 19->39 40 Leaf Cover: 1106 Value: 1.70189703 20->40 < 1.27933609 41 Leaf Cover: 391 Value: 2.57716823 20->41 42 Leaf Cover: 3467 Value: 0.905464292 21->42 < 1.22136807 43 Leaf Cover: 1006 Value: 1.6982125 21->43 44 Leaf Cover: 2122 Value: 0.346961856 22->44 < 0.962226629 45 Leaf Cover: 1971 Value: -0.154437125 22->45 46 Leaf Cover: 847 Value: 1.0005306 23->46 < 1.30575919 47 Leaf Cover: 532 Value: 0.251594752 23->47 48 Leaf Cover: 1127 Value: -0.589317381 24->48 < -1.05882061 49 Leaf Cover: 976 Value: -0.0729785115 24->49 50 Leaf Cover: 377 Value: -1.58214283 25->50 < -1.5913372 51 Leaf Cover: 2048 Value: -0.843533456 25->51 52 Leaf Cover: 2439 Value: 0.228094265 26->52 < -0.503130674 53 Leaf Cover: 2281 Value: -0.39283523 26->53 54 Leaf Cover: 1359 Value: 0.716727912 27->54 < -0.363686532 55 Leaf Cover: 1225 Value: 0.249143556 27->55 56 Leaf Cover: 733 Value: -1.85374665 28->56 < -1.03259516 57 Leaf Cover: 2260 Value: -1.14241493 28->57 58 Leaf Cover: 276 Value: -2.94909739 29->58 < -1.37369728 59 Leaf Cover: 1009 Value: -2.00747538 29->59 60 Leaf Cover: 1709 Value: -0.612836301 30->60 < 0.715407073 61 Leaf Cover: 1428 Value: 0.0135759283 30->61 62 Leaf Cover: 1181 Value: -1.2383672 31->62 < 0.907768488 63 Leaf Cover: 910 Value: -0.546871603 31->63
xgb.plot.multi.trees(
  model = xgb_reg,
  feature_names = feature_names
)
%0 1 games.away_t1 (3101490) games.home_t1 (2576059) games.scorediff_season_avg_t1 (  46874) games.astto_season_avg_t2 (  45997) games.scorediff_3game_t2 (  17883) 2 games.scorediff_season_avg_t1 (1458471) games.scorediff_season_avg_t2 ( 657999) games.scorediff_3game_t1 (  15835) games.ordiff_season_avg_t1 (  25398) games.astto_season_avg_t1 (  40583) 1->2 3 games.scorediff_season_avg_t2 (1119977.5) games.scorediff_season_avg_t1 (1054218.1) games.scorediff_3game_t2 ( 145880.3) games.scorediff_3game_t1 (  32759.8) games.astto_season_avg_t2 (   9466.8) 1->3 4 games.scorediff_season_avg_t2 (984419) games.scorediff_season_avg_t1 (462790) games.astto_season_avg_t2 ( 35598) games.scorediff_3game_t1 ( 10233) games.ordiff_season_avg_t1 ( 11252) 2->4 5 games.scorediff_season_avg_t2 (1190567.5) games.scorediff_season_avg_t1 ( 572739.6) games.astto_season_avg_t2 (  29136.1) games.ordiff_season_avg_t2 (  11737.8) games.opp_fg3perc_3game_t2 (   2855.8) 2->5 6 games.scorediff_season_avg_t1 (528102) games.scorediff_season_avg_t2 (549530) games.scorediff_3game_t1 ( 41601) games.scorediff_3game_t2 ( 37358) games.ordiff_season_avg_t2 ( 11356) 3->6 7 games.scorediff_season_avg_t1 (700730) games.scorediff_season_avg_t2 (480599) games.scorediff_3game_t1 ( 57874) games.astto_season_avg_t2 ( 54883) games.away_t1 ( 17286) 3->7 8 games.scorediff_season_avg_t1 (235573.9) games.scorediff_season_avg_t2 (223641.2) games.scorediff_3game_t2 ( 21873.4) games.drdiff_3game_t2 (  4958.1) games.astto_season_avg_t1 (  5129.8) 4->8 9 games.scorediff_season_avg_t1 (186016.7) games.scorediff_season_avg_t2 (100188.1) games.astto_season_avg_t1 ( 15602.5) games.opp_fg3perc_season_avg_t1 ( 11930.3) games.ordiff_season_avg_t1 (  5073.5) 4->9 10 games.scorediff_season_avg_t2 (165733.6) games.scorediff_season_avg_t1 (271424.0) games.ordiff_season_avg_t1 (  5244.8) games.fgperc_3game_t2 (  2714.1) games.opp_fg3perc_season_avg_t2 (  7029.3) 5->10 11 games.scorediff_season_avg_t1 (328899.7) games.scorediff_season_avg_t2 ( 81071.6) games.fgperc_season_avg_t1 (  4845.9) games.opp_fg3perc_season_avg_t2 (  7229.9) games.astto_season_avg_t1 ( 15834.6) 5->11 12 games.scorediff_season_avg_t2 (136654.4) games.scorediff_season_avg_t1 (205914.1) games.astto_season_avg_t2 (  9562.5) games.scorediff_3game_t2 ( 20243.7) games.astto_season_avg_t1 (  7646.8) 6->12 13 games.scorediff_season_avg_t1 (264082.3) games.scorediff_season_avg_t2 ( 34854.1) games.astto_season_avg_t2 ( 19618.5) games.ordiff_season_avg_t1 ( 13771.8) games.astto_season_avg_t1 (  4953.8) 6->13 14 games.scorediff_season_avg_t2 (239524.2) games.scorediff_season_avg_t1 (124873.9) games.ordiff_season_avg_t1 (  8545.5) games.astto_season_avg_t1 (  5496.2) games.ordiff_season_avg_t2 (  7033.1) 7->14 15 games.scorediff_season_avg_t2 (229791.9) games.scorediff_season_avg_t1 ( 75030.9) games.scorediff_3game_t2 ( 13444.1) games.fgperc_season_avg_t2 (  6333.8) games.drdiff_season_avg_t2 (  5356.9) 7->15 16 games.scorediff_season_avg_t2 (74765.5) games.scorediff_season_avg_t1 (72456.0) games.ordiff_3game_t2 ( 3626.4) games.fgperc_season_avg_t1 ( 9499.8) games.ordiff_season_avg_t2 ( 2659.3) 8->16 17 games.scorediff_season_avg_t2 (144888.1) games.scorediff_season_avg_t1 (127048.0) games.astto_season_avg_t1 (  9876.8) games.astto_season_avg_t2 (  2929.5) games.fgperc_season_avg_t2 (  4229.6) 8->17 18 games.scorediff_season_avg_t2 (87102.8) games.scorediff_season_avg_t1 (45201.4) games.scorediff_3game_t2 ( 8332.3) games.ordiff_season_avg_t2 ( 3412.0) games.drdiff_season_avg_t1 ( 3619.7) 9->18 19 games.scorediff_season_avg_t2 (104142.4) games.scorediff_season_avg_t1 ( 50263.7) games.astto_season_avg_t2 (  4606.0) games.fg3perc_season_avg_t2 (  1873.9) games.astto_season_avg_t1 (  6851.8) 9->19 20 games.scorediff_season_avg_t1 ( 64509.6) games.scorediff_season_avg_t2 (123433.3) games.ordiff_season_avg_t1 (  6251.2) games.opp_fg3perc_season_avg_t2 (  7719.0) games.ordiff_season_avg_t2 ( 12213.5) 10->20 21 games.scorediff_season_avg_t1 (93154.4) games.scorediff_season_avg_t2 (69054.9) games.opp_fgperc_season_avg_t2 ( 3221.6) games.astto_season_avg_t2 (11005.3) games.scorediff_3game_t2 ( 7185.6) 10->21 22 games.scorediff_season_avg_t2 (168142.7) games.scorediff_season_avg_t1 ( 45622.5) games.astto_season_avg_t2 (  5666.5) games.ordiff_season_avg_t1 (  5413.1) games.astto_3game_t2 ( 12192.4) 11->22 23 games.scorediff_season_avg_t2 (134500.4) games.scorediff_season_avg_t1 ( 37336.3) games.fgperc_3game_t1 (  7343.9) games.score_3game_t2 (  2696.3) games.fgperc_season_avg_t2 (  6843.8) 11->23 24 games.scorediff_season_avg_t1 (44210.2) games.scorediff_season_avg_t2 (65884.2) games.ordiff_season_avg_t2 ( 9289.1) games.astto_season_avg_t2 ( 5015.8) games.scorediff_3game_t2 ( 9424.1) 12->24 25 games.scorediff_season_avg_t1 ( 82775.7) games.scorediff_season_avg_t2 (149263.5) games.ordiff_3game_t1 (  4885.4) games.astto_season_avg_t2 (  6924.8) games.astto_season_avg_t1 (  4900.9) 12->25 26 games.scorediff_season_avg_t2 (120688.5) games.scorediff_season_avg_t1 ( 12987.3) games.scorediff_3game_t2 (  5222.4) games.scorediff_3game_t1 (  6332.0) games.astto_season_avg_t2 (  7921.8) 13->26 27 games.scorediff_season_avg_t2 (110130.3) games.scorediff_season_avg_t1 ( 27360.4) games.opp_fg3perc_3game_t2 (  6512.0) games.ordiff_season_avg_t2 (  3312.8) games.fgperc_season_avg_t1 (  6984.0) 13->27 28 games.scorediff_season_avg_t1 (105198.6) games.scorediff_season_avg_t2 ( 51586.0) games.scorediff_3game_t2 (  6331.6) games.opp_fg3perc_season_avg_t1 (  6899.8) games.astto_season_avg_t1 (  8989.7) 14->28 29 games.scorediff_season_avg_t1 (60840.6) games.astto_season_avg_t2 (20867.1) games.drdiff_season_avg_t1 ( 5296.7) games.scorediff_season_avg_t2 (11779.3) games.opp_fg3perc_season_avg_t2 ( 4054.1) 14->29 30 games.scorediff_season_avg_t1 (145683.0) games.scorediff_season_avg_t2 ( 52232.1) games.drdiff_3game_t1 (  2626.9) games.astto_season_avg_t1 (  3607.8) games.ordiff_season_avg_t1 (  6178.9) 15->30 31 games.scorediff_season_avg_t1 (92719.1) games.scorediff_season_avg_t2 (35484.2) games.scorediff_3game_t2 ( 5828.8) games.drdiff_season_avg_t1 ( 6031.7) games.ftperc_3game_t2 ( 3558.8) 15->31 32 Leaf (-8.2755) 16->32 33 Leaf (1.5422) 16->33 34 Leaf (-2.9337) 17->34 35 Leaf (2.0117) 17->35 36 Leaf (-2.0611) 18->36 37 Leaf (-5.7658) 18->37 38 Leaf (-2.7438) 19->38 39 Leaf (1.5468) 19->39 40 Leaf (1.3377) 20->40 41 Leaf (1.0625) 20->41 42 Leaf (9.4373) 21->42 43 Leaf (3.5413) 21->43 44 Leaf (-10.434) 22->44 45 Leaf (-10.056) 22->45 46 Leaf (-2.3666) 23->46 47 Leaf (-0.73341) 23->47 48 Leaf (2.1463) 24->48 49 Leaf (-3.0579) 24->49 50 Leaf (4.969) 25->50 51 Leaf (2.03) 25->51 52 Leaf (-8.9479) 26->52 53 Leaf (-16.1) 26->53 54 Leaf (5.1993) 27->54 55 Leaf (-4.8637) 27->55 56 Leaf (-9.7732) 28->56 57 Leaf (3.3264) 28->57 58 Leaf (-8.7159) 29->58 59 Leaf (-2.8272) 29->59 60 Leaf (4.7643) 30->60 61 Leaf (7.4644) 30->61 62 Leaf (-6.4998) 31->62 63 Leaf (-1.6437) 31->63

Conclusion

At the end of this series of posts, I don’t think I quite trust these models enough to start betting on college basketball games. This was a lot of fun to work on, and definitely challenging. My takeaways from this project, and my professional experience for that matter, is that the feature engineering side of data science is much more difficult and time consuming that the modeling side of things. As for using xgboost and keras for classification and regression tasks, I think I like xgboost a bit better, simply for the fact that I think it was a bit easier for me to use. From a performance perspective, we obviously saw very similar results.

To wrap up the work on this series of posts, I will be updating the shiny app that will have the clustering example baked in, as well as a tab to obtain predictions from each of the four models built in this series. As of right now, it will just be using the testing set that we used in this series, but if I come back to it in the future, I would like to have the ability to predict game outcomes on an ongoing basis. We’ll see if that ever happens…

Avatar
Kip Brown
Data Scientist

Related