Introduction
Well hello there world. This is my first blog post. Yay.
I am going to be starting a series of posts revolving around a Kaggle competition a couble of years back that used NCAA basketball regular season and tournament data.
The data used in this analysis comes from the kaggle competition ’ March Machine Learning Mania 2016’. This tutorial will walk through a few things to get us ready for an analysis.
- Using bash code chunks
- Using the Kaggle API to download data
- Using the DBI package to connect to a SQLite database
- Using SQL code chunks
Disclaimer: I am working on a Linux machine, specifically Elementary OS version 0.4.1 Loki. Small sidebar. I am new to Linux. I just built a computer and wanted a free OS, so I decided to try out a Linux OS, and I have loved using Elementary. Here is the link to there site if you are interestedd https://elementary.io/. Okay, back to business. I am also new to using the terminal, so don’t crush me if I am doing something inefficiently. Also, I am not going to get into the weeds discussing how to configure all of the dependencies for the Kaggle API, I am just going to assume you can figure it out on your own!
Downloading the Data set
As mentioned above, the data comes from a Kaggle competition. Here is the link https://www.kaggle.com/c/march-machine-learning-mania-2016/data. Before downloading the data, lets make a directory. We could do this using R, but this is a fun opportunity to utilize R Markdowns bash code chunks.
mkdir ~/ncaa_data
Sidebar, I hate naming things. Anyways, lets move to that new directory we created, and use the Kaggle API to download some data.
cd ~/ncaa_data
# Downloading Data
kaggle competitions download -c march-machine-learning-mania-2016
# Unzipping File
unzip march-machine-learning-mania-2016-v2.zip
Now the data is in a sub-directory of the ncaa_data directory that we created. There are csv files, but we are going to connect to the sqlite database provided instead. The csv files are just copies of what is in the sqlite database.
cd ~/ncaa_data/march-machine-learning-mania-2016-v2/
# Copying the database to our directory
cp database.sqlite ~/ncaa_data
# Deleting everything else
cd ~/ncaa_data
rm -rf march-machine-learning-mania-2016-v2.zip
rm -rf march-machine-learning-mania-2016-v2
rm -rf __MACOSX/
Now that we have the database ready to go, lets take a peek inside!
Exploring the Database
Now the the data is on our machine, we can dig into the database.
base_dir <- '~/ncaa_data'
db_connection <- dbConnect(
drv = RSQLite::SQLite(),
dbname = file.path(base_dir, 'database.sqlite')
)
dbListTables(conn = db_connection) %>%
as_tibble() %>%
rename(Tables = value) %>%
kable('html') %>%
kable_styling(
bootstrap_options = c("striped", "hover")
)
Tables |
---|
RegularSeasonCompactResults |
RegularSeasonDetailedResults |
SampleSubmission |
Seasons |
Teams |
TourneyCompactResults |
TourneyDetailedResults |
TourneySeeds |
TourneySlots |
Here we can see the 9 tables that come with the database. A deeper mdescription of the data can be found at https://www.kaggle.com/c/march-machine-learning-mania-2016/data. We aren’t going to need all of these tables. Lets look at a sample from a few of the tables to decide which ones I want.
Using R markdown SQL Chunks
All of this can be done using the DBI package, I just think the SQL code chunks are a super cool feature, so I like to use them whenever I get a chance. They are pretty straight forward. All you have to do is provide the connection object we defined above, and then write a SQL query.
First, I want to understand the difference between RegularSeasonCompactResults and RegularSeasonDetailedResults.
SELECT
*
FROM
RegularSeasonCompactResults
LIMIT
10
Season | Daynum | Wteam | Wscore | Lteam | Lscore | Wloc | Numot |
---|---|---|---|---|---|---|---|
1985 | 20 | 1228 | 81 | 1328 | 64 | N | 0 |
1985 | 25 | 1106 | 77 | 1354 | 70 | H | 0 |
1985 | 25 | 1112 | 63 | 1223 | 56 | H | 0 |
1985 | 25 | 1165 | 70 | 1432 | 54 | H | 0 |
1985 | 25 | 1192 | 86 | 1447 | 74 | H | 0 |
1985 | 25 | 1218 | 79 | 1337 | 78 | H | 0 |
1985 | 25 | 1228 | 64 | 1226 | 44 | N | 0 |
1985 | 25 | 1242 | 58 | 1268 | 56 | N | 0 |
1985 | 25 | 1260 | 98 | 1133 | 80 | H | 0 |
1985 | 25 | 1305 | 97 | 1424 | 89 | H | 0 |
You can also save the the output of a query, and use it in an R code chunk.
SELECT
*
FROM
RegularSeasonDetailedResults
LIMIT
10
sample %>%
kable('html') %>%
kable_styling(
bootstrap_options = c("striped", "hover")
) %>%
scroll_box(width = "100%")
Season | Daynum | Wteam | Wscore | Lteam | Lscore | Wloc | Numot | Wfgm | Wfga | Wfgm3 | Wfga3 | Wftm | Wfta | Wor | Wdr | Wast | Wto | Wstl | Wblk | Wpf | Lfgm | Lfga | Lfgm3 | Lfga3 | Lftm | Lfta | Lor | Ldr | Last | Lto | Lstl | Lblk | Lpf |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 | 10 | 1104 | 68 | 1328 | 62 | N | 0 | 27 | 58 | 3 | 14 | 11 | 18 | 14 | 24 | 13 | 23 | 7 | 1 | 22 | 22 | 53 | 2 | 10 | 16 | 22 | 10 | 22 | 8 | 18 | 9 | 2 | 20 |
2003 | 10 | 1272 | 70 | 1393 | 63 | N | 0 | 26 | 62 | 8 | 20 | 10 | 19 | 15 | 28 | 16 | 13 | 4 | 4 | 18 | 24 | 67 | 6 | 24 | 9 | 20 | 20 | 25 | 7 | 12 | 8 | 6 | 16 |
2003 | 11 | 1266 | 73 | 1437 | 61 | N | 0 | 24 | 58 | 8 | 18 | 17 | 29 | 17 | 26 | 15 | 10 | 5 | 2 | 25 | 22 | 73 | 3 | 26 | 14 | 23 | 31 | 22 | 9 | 12 | 2 | 5 | 23 |
2003 | 11 | 1296 | 56 | 1457 | 50 | N | 0 | 18 | 38 | 3 | 9 | 17 | 31 | 6 | 19 | 11 | 12 | 14 | 2 | 18 | 18 | 49 | 6 | 22 | 8 | 15 | 17 | 20 | 9 | 19 | 4 | 3 | 23 |
2003 | 11 | 1400 | 77 | 1208 | 71 | N | 0 | 30 | 61 | 6 | 14 | 11 | 13 | 17 | 22 | 12 | 14 | 4 | 4 | 20 | 24 | 62 | 6 | 16 | 17 | 27 | 21 | 15 | 12 | 10 | 7 | 1 | 14 |
2003 | 11 | 1458 | 81 | 1186 | 55 | H | 0 | 26 | 57 | 6 | 12 | 23 | 27 | 12 | 24 | 12 | 9 | 9 | 3 | 18 | 20 | 46 | 3 | 11 | 12 | 17 | 6 | 22 | 8 | 19 | 4 | 3 | 25 |
2003 | 12 | 1161 | 80 | 1236 | 62 | H | 0 | 23 | 55 | 2 | 8 | 32 | 39 | 13 | 18 | 14 | 17 | 11 | 1 | 25 | 19 | 41 | 4 | 15 | 20 | 28 | 9 | 21 | 11 | 30 | 10 | 4 | 28 |
2003 | 12 | 1186 | 75 | 1457 | 61 | N | 0 | 28 | 62 | 4 | 14 | 15 | 21 | 13 | 35 | 19 | 19 | 7 | 2 | 21 | 20 | 59 | 4 | 17 | 17 | 23 | 8 | 25 | 10 | 15 | 14 | 8 | 18 |
2003 | 12 | 1194 | 71 | 1156 | 66 | N | 0 | 28 | 58 | 5 | 11 | 10 | 18 | 9 | 22 | 9 | 17 | 9 | 2 | 23 | 24 | 52 | 6 | 18 | 12 | 27 | 13 | 26 | 13 | 25 | 8 | 2 | 18 |
2003 | 12 | 1458 | 84 | 1296 | 56 | H | 0 | 32 | 67 | 5 | 17 | 15 | 19 | 14 | 22 | 11 | 6 | 12 | 0 | 13 | 23 | 52 | 3 | 14 | 7 | 12 | 9 | 23 | 10 | 18 | 1 | 3 | 18 |
Obviously the RegularSeasonCompactResults has WAY less information, so I am going to ditch that table, as well as a few others. Before doing that, I went ahead and joined the season and team details from two of the other tables.
SELECT
wt.Team_Name AS winner,
lt.Team_Name AS Loser,
rs.Season,
S.Dayzero,
rs.Daynum,
rs.Wscore,
rs.Lscore,
rs.Wloc,
rs.Numot,
rs.Wblk,
rs.Wfga,
rs.Wfga3,
rs.Wfgm,
rs.Wfgm3,
rs.Wfta,
rs.Wftm,
rs.Wdr,
rs.Wor,
rs.Wast,
rs.Wto,
rs.Wpf,
rs.Lblk,
rs.Ldr,
rs.Lor,
rs.Lfga,
rs.Lfga3,
rs.Lfgm,
rs.Lfgm3,
rs.Lfta,
rs.Lftm,
rs.Last,
rs.Lto,
rs.Lpf
FROM
RegularSeasonDetailedResults rs
LEFT JOIN Seasons s ON s.Season = rs.Season
LEFT JOIN Teams wt ON wt.Team_Id = rs.Wteam
LEFT JOIN Teams lt ON lt.Team_Id = rs.Lteam
Now for post season…
SELECT
wt.Team_Name AS winner,
lt.Team_Name AS Loser,
ps.Season,
S.Dayzero,
ps.Daynum,
ps.Wscore,
ps.Lscore,
ps.Wloc,
ps.Numot,
ps.Wblk,
ps.Wfga,
ps.Wfga3,
ps.Wfgm,
ps.Wfgm3,
ps.Wfta,
ps.Wftm,
ps.Wdr,
ps.Wor,
ps.Wast,
ps.Wto,
ps.Wpf,
ps.Lblk,
ps.Ldr,
ps.Lor,
ps.Lfga,
ps.Lfga3,
ps.Lfgm,
ps.Lfgm3,
ps.Lfta,
ps.Lftm,
ps.Last,
ps.Lto,
ps.Lpf
FROM
TourneyDetailedResults ps
LEFT JOIN Seasons s ON s.Season = ps.Season
LEFT JOIN Teams wt ON wt.Team_Id = ps.Wteam
LEFT JOIN Teams lt ON lt.Team_Id = ps.Lteam
Now that I have the data queried into the format that I want, I can put it together into a single data set.
# Labelling Source Location
regular_season$type = 'regular'
post_season$type = 'post'
# Binding data into single dataset
games <- bind_rows(regular_season, post_season)
I am a bit of a neat freak, so now I am going to write my new table and delete all of the redundent tables with a few functions from the DBI package.
dbWriteTable(
conn = db_connection,
name = 'games',
value = games
)
dbRemoveTable(
conn = db_connection,
name = 'RegularSeasonCompactResults'
)
dbRemoveTable(
conn = db_connection,
name = 'RegularSeasonDetailedResults'
)
dbRemoveTable(
conn = db_connection,
name = 'TourneyDetailedResults'
)
dbRemoveTable(
conn = db_connection,
name = 'TourneyCompactResults'
)
dbRemoveTable(
conn = db_connection,
name = 'SampleSubmission'
)
dbDisconnect(db_connection)
Reminder: Remember to disconnect from the database. I forget pretty much every time…
Simple as that! We have a directory that holds the data that we downloaded, and cleaded up the database a little bit. Next step, some exploratory data analysis. Coming soon! Thanks for reading.
Extra Help
I refer to the tutorials and references below a lot when I do this type of work. If you are stuck on anything I did, I would refer to some of these down here, they were super helpful for me when I was learning.