NCAA Basketball Analysis Part 1 - Downloading and Preparing the data for Analysis

Introduction

Well hello there world. This is my first blog post. Yay.

I am going to be starting a series of posts revolving around a Kaggle competition a couble of years back that used NCAA basketball regular season and tournament data.

The data used in this analysis comes from the kaggle competition ’ March Machine Learning Mania 2016’. This tutorial will walk through a few things to get us ready for an analysis.

  • Using bash code chunks
  • Using the Kaggle API to download data
  • Using the DBI package to connect to a SQLite database
  • Using SQL code chunks

Disclaimer: I am working on a Linux machine, specifically Elementary OS version 0.4.1 Loki. Small sidebar. I am new to Linux. I just built a computer and wanted a free OS, so I decided to try out a Linux OS, and I have loved using Elementary. Here is the link to there site if you are interestedd https://elementary.io/. Okay, back to business. I am also new to using the terminal, so don’t crush me if I am doing something inefficiently. Also, I am not going to get into the weeds discussing how to configure all of the dependencies for the Kaggle API, I am just going to assume you can figure it out on your own!

Downloading the Data set

As mentioned above, the data comes from a Kaggle competition. Here is the link https://www.kaggle.com/c/march-machine-learning-mania-2016/data. Before downloading the data, lets make a directory. We could do this using R, but this is a fun opportunity to utilize R Markdowns bash code chunks.

mkdir ~/ncaa_data

Sidebar, I hate naming things. Anyways, lets move to that new directory we created, and use the Kaggle API to download some data.

cd ~/ncaa_data

# Downloading Data
kaggle competitions download -c march-machine-learning-mania-2016

# Unzipping File
unzip march-machine-learning-mania-2016-v2.zip

Now the data is in a sub-directory of the ncaa_data directory that we created. There are csv files, but we are going to connect to the sqlite database provided instead. The csv files are just copies of what is in the sqlite database.

cd ~/ncaa_data/march-machine-learning-mania-2016-v2/

# Copying the database to our directory
cp database.sqlite ~/ncaa_data

# Deleting everything else
cd ~/ncaa_data
rm -rf march-machine-learning-mania-2016-v2.zip
rm -rf march-machine-learning-mania-2016-v2
rm -rf __MACOSX/

Now that we have the database ready to go, lets take a peek inside!

Exploring the Database

Now the the data is on our machine, we can dig into the database.

base_dir <- '~/ncaa_data'
db_connection <- dbConnect(
      drv = RSQLite::SQLite(), 
      dbname = file.path(base_dir, 'database.sqlite')
)

dbListTables(conn = db_connection) %>% 
      as_tibble() %>% 
      rename(Tables = value) %>% 
      kable('html') %>% 
      kable_styling(
            bootstrap_options = c("striped", "hover")
      )
Tables
RegularSeasonCompactResults
RegularSeasonDetailedResults
SampleSubmission
Seasons
Teams
TourneyCompactResults
TourneyDetailedResults
TourneySeeds
TourneySlots

Here we can see the 9 tables that come with the database. A deeper mdescription of the data can be found at https://www.kaggle.com/c/march-machine-learning-mania-2016/data. We aren’t going to need all of these tables. Lets look at a sample from a few of the tables to decide which ones I want.

Using R markdown SQL Chunks

All of this can be done using the DBI package, I just think the SQL code chunks are a super cool feature, so I like to use them whenever I get a chance. They are pretty straight forward. All you have to do is provide the connection object we defined above, and then write a SQL query.

First, I want to understand the difference between RegularSeasonCompactResults and RegularSeasonDetailedResults.

SELECT
      *
FROM
      RegularSeasonCompactResults
LIMIT 
      10
Table 1: Displaying records 1 - 10
Season Daynum Wteam Wscore Lteam Lscore Wloc Numot
1985 20 1228 81 1328 64 N 0
1985 25 1106 77 1354 70 H 0
1985 25 1112 63 1223 56 H 0
1985 25 1165 70 1432 54 H 0
1985 25 1192 86 1447 74 H 0
1985 25 1218 79 1337 78 H 0
1985 25 1228 64 1226 44 N 0
1985 25 1242 58 1268 56 N 0
1985 25 1260 98 1133 80 H 0
1985 25 1305 97 1424 89 H 0

You can also save the the output of a query, and use it in an R code chunk.

SELECT
      *
FROM
      RegularSeasonDetailedResults
LIMIT 
      10
sample %>% 
      kable('html') %>% 
      kable_styling(
            bootstrap_options = c("striped", "hover")
            ) %>% 
      scroll_box(width = "100%")
Season Daynum Wteam Wscore Lteam Lscore Wloc Numot Wfgm Wfga Wfgm3 Wfga3 Wftm Wfta Wor Wdr Wast Wto Wstl Wblk Wpf Lfgm Lfga Lfgm3 Lfga3 Lftm Lfta Lor Ldr Last Lto Lstl Lblk Lpf
2003 10 1104 68 1328 62 N 0 27 58 3 14 11 18 14 24 13 23 7 1 22 22 53 2 10 16 22 10 22 8 18 9 2 20
2003 10 1272 70 1393 63 N 0 26 62 8 20 10 19 15 28 16 13 4 4 18 24 67 6 24 9 20 20 25 7 12 8 6 16
2003 11 1266 73 1437 61 N 0 24 58 8 18 17 29 17 26 15 10 5 2 25 22 73 3 26 14 23 31 22 9 12 2 5 23
2003 11 1296 56 1457 50 N 0 18 38 3 9 17 31 6 19 11 12 14 2 18 18 49 6 22 8 15 17 20 9 19 4 3 23
2003 11 1400 77 1208 71 N 0 30 61 6 14 11 13 17 22 12 14 4 4 20 24 62 6 16 17 27 21 15 12 10 7 1 14
2003 11 1458 81 1186 55 H 0 26 57 6 12 23 27 12 24 12 9 9 3 18 20 46 3 11 12 17 6 22 8 19 4 3 25
2003 12 1161 80 1236 62 H 0 23 55 2 8 32 39 13 18 14 17 11 1 25 19 41 4 15 20 28 9 21 11 30 10 4 28
2003 12 1186 75 1457 61 N 0 28 62 4 14 15 21 13 35 19 19 7 2 21 20 59 4 17 17 23 8 25 10 15 14 8 18
2003 12 1194 71 1156 66 N 0 28 58 5 11 10 18 9 22 9 17 9 2 23 24 52 6 18 12 27 13 26 13 25 8 2 18
2003 12 1458 84 1296 56 H 0 32 67 5 17 15 19 14 22 11 6 12 0 13 23 52 3 14 7 12 9 23 10 18 1 3 18

Obviously the RegularSeasonCompactResults has WAY less information, so I am going to ditch that table, as well as a few others. Before doing that, I went ahead and joined the season and team details from two of the other tables.

SELECT
      wt.Team_Name AS winner,
      lt.Team_Name AS Loser,
      rs.Season,
      S.Dayzero,
      rs.Daynum,
      rs.Wscore,
      rs.Lscore,
      rs.Wloc,
      rs.Numot,
      rs.Wblk,
      rs.Wfga,
      rs.Wfga3,
      rs.Wfgm,
      rs.Wfgm3,
      rs.Wfta,
      rs.Wftm,
      rs.Wdr,
      rs.Wor,
      rs.Wast,
      rs.Wto,
      rs.Wpf,
      rs.Lblk,
      rs.Ldr,
      rs.Lor,
      rs.Lfga,
      rs.Lfga3,
      rs.Lfgm,
      rs.Lfgm3,
      rs.Lfta,
      rs.Lftm,
      rs.Last,
      rs.Lto,
      rs.Lpf
FROM
      RegularSeasonDetailedResults rs
      LEFT JOIN Seasons s ON s.Season = rs.Season
      LEFT JOIN Teams wt ON wt.Team_Id = rs.Wteam
      LEFT JOIN Teams lt ON lt.Team_Id = rs.Lteam

Now for post season…

SELECT
      wt.Team_Name AS winner,
      lt.Team_Name AS Loser,
      ps.Season,
      S.Dayzero,
      ps.Daynum,
      ps.Wscore,
      ps.Lscore,
      ps.Wloc,
      ps.Numot,
      ps.Wblk,
      ps.Wfga,
      ps.Wfga3,
      ps.Wfgm,
      ps.Wfgm3,
      ps.Wfta,
      ps.Wftm,
      ps.Wdr,
      ps.Wor,
      ps.Wast,
      ps.Wto,
      ps.Wpf,
      ps.Lblk,
      ps.Ldr,
      ps.Lor,
      ps.Lfga,
      ps.Lfga3,
      ps.Lfgm,
      ps.Lfgm3,
      ps.Lfta,
      ps.Lftm,
      ps.Last,
      ps.Lto,
      ps.Lpf
FROM
      TourneyDetailedResults ps
      LEFT JOIN Seasons s ON s.Season = ps.Season
      LEFT JOIN Teams wt ON wt.Team_Id = ps.Wteam
      LEFT JOIN Teams lt ON lt.Team_Id = ps.Lteam

Now that I have the data queried into the format that I want, I can put it together into a single data set.

# Labelling Source Location
regular_season$type = 'regular'
post_season$type = 'post'

# Binding data into single dataset
games <- bind_rows(regular_season, post_season)

I am a bit of a neat freak, so now I am going to write my new table and delete all of the redundent tables with a few functions from the DBI package.

dbWriteTable(
      conn = db_connection,
      name = 'games',
      value = games
)

dbRemoveTable(
      conn = db_connection,
      name = 'RegularSeasonCompactResults'
)
dbRemoveTable(
      conn = db_connection,
      name = 'RegularSeasonDetailedResults'
)
dbRemoveTable(
      conn = db_connection,
      name = 'TourneyDetailedResults'
)
dbRemoveTable(
      conn = db_connection,
      name = 'TourneyCompactResults'
)
dbRemoveTable(
      conn = db_connection,
      name = 'SampleSubmission'
)
dbDisconnect(db_connection)

Reminder: Remember to disconnect from the database. I forget pretty much every time…

Simple as that! We have a directory that holds the data that we downloaded, and cleaded up the database a little bit. Next step, some exploratory data analysis. Coming soon! Thanks for reading.

Extra Help

I refer to the tutorials and references below a lot when I do this type of work. If you are stuck on anything I did, I would refer to some of these down here, they were super helpful for me when I was learning.

SQL and Bash Code Chunks DBI Functions Kaggle API

Avatar
Kip Brown
Data Scientist