STAT 19000: Project 14 — Spring 2021

Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.

Context: We are on the last project where we will leave it up to you on how to solve the problems presented.

Scope: python, r, bash, unix, computers

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

  • /class/datamine/data/disney

  • /class/datamine/data/movies_and_tv/imdb.db

  • /class/datamine/data/amazon/music.txt

  • /class/datamine/data/craigslist/vehicles.csv

  • /class/datamine/data/flights/2008.csv

Questions

Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don’t feel limited by one language, you can use different languages to answer different questions. If you are feeling bold, you can also try answering the questions using all languages!

Question 1

What percentage of flights in 2008 had a delay due to the weather? Use the /class/datamine/data/flights/2008.csv dataset to answer this question.

Consider a flight to have a weather delay if WEATHER_DELAY is greater than 0.

Items to submit
  • The code used to solve the question.

  • The answer to the question.

Question 2

Which listed manufacturer has the most expensive previously owned car listed in Craiglist? Use the /class/datamine/data/craigslist/vehicles.csv dataset to answer this question. Only consider listings that have listed price less than $500,000 and where manufacturer information is available.

Items to submit
  • The code used to solve the question.

  • The answer to the question.

Question 3

What is the most common and least common type of title in imdb ratings? Use the /class/datamine/data/movies_and_tv/imdb.db dataset to answer this question.

Use the titles table.

Don’t know how to use SQL yet? To get this data into an R data.frame, for example:

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
myDF <- tbl(con, "titles")
Items to submit
  • The code used to solve the question.

  • The answer to the question.

Question 4

What percentage of music reviews contain the words 'hate' or 'hated', and what percentage contain the words 'love' or 'loved'? Use the /class/datamine/data/amazon/music.txt dataset to answer this question.

It may take a minute to run, depending on the tool you use.

Items to submit
  • The code used to solve the question.

  • The answer to the question.

Question 5

What is the best time to visit Disney? Use the data provided in /class/datamine/data/disney to answer the question.

First, you will need determine what you will consider "time", and the criteria you will use. See below some examples. Don’t feel limited by them! Be sure to explain your criteria, use the data to investigate, and determine the best time to visit! Write 1-2 sentences commenting on your findings.

  • As Splash Mountain is my favorite ride, my criteria is the smallest monthly average wait times for Splash Mountain between the years 2017 and 2019. I’m only considering these years as I expect them to be more representative. My definition of "best time" will be the "best months".

  • Consider "best times" the days of the week that have the smallest wait time on average for all rides, or for certain favorite rides.

  • Consider "best times" the season of the year where the park is open for longer hours.

  • Consider "best times" the weeks of the year with smallest average high temperature in the day.

Items to submit
  • The code used to solve the question.

  • 1-2 sentences detailing the criteria you are going to use, its logic, and your defition for "best time".

  • The answer to the question.

  • 1-2 sentences commenting on your answer.

Question 6

Finally, use RMarkdown (and its formatting) to outline 3 things you learned this semester from The Data Mine. For each thing you learned, give a mini demonstration where you highlight with text and code the thing you learned, and why you think it is useful. If you did not learn anything this semester from The Data Mine, write about 3 things you want to learn. Provide examples that demonstrate what you want to learn and write about why it would be useful.

Make sure your answer to this question is formatted well and makes use of RMarkdown.

Items to submit
  • 3 clearly labeled things you learned.

  • 3 mini-demonstrations where you highlight with text and code the thin you learned, and why you think it is useful. OR

  • 3 clearly labeled things you want to learn.

  • 3 examples demonstrating what you want to learn, with accompanying text explaining why you think it would be useful.