STAT 29000: Project 10 — Spring 2022
Motivation: The use of a suite of packages referred to as the tidyverse
is popular with many R users. It is apparent just by looking at tidyverse
R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!
Context: We’ve covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the tidyverse
(including ggplot
) and data wrangling tasks.
Scope: R, tidyverse, ggplot
Make sure to read about, and use the template found here, and the important information about projects submissions here.
The "tidyverse" consists of a variety of packages, including, but not limited to: ggplot2
, dplyr
, tidyr
, readr
, magrittr
, purrr
, tibble
, stringr
, and lubridate
.
One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s book, R for Data Science.
There is an excellent graphic here that illustrates a general workflow for data science projects:
-
Import
-
Tidy
-
Iterate on, to gain understanding:
-
Transform
-
Visualize
-
Model
-
-
Communicate
This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/beer/beers.csv
-
/depot/datamine/data/beer/reviews_sample.csv
Questions
Question 1
The first step in our workflow is to read the data.
Read the datasets beers.csv
and reviews_sample.csv
using the read_csv
function from tidyverse
into tibbles called beers
and reviews
, respectively.
"Tibble" are essentially the |
In projects 10 and 11, we want to analyze and compare different beers. Note, that in reviews
each row corresponds to a review by a certain user on a certain date. As reviews likely vary by individuals, we may want to summarize our reviews
tibble.
To do that, let’s start by deciding how we are going to summarize the reviews. Start by picking one of the variables (columns) from the reviews
dataset to be our "beer goodness indicator". For example, maybe you believe that the taste
is important in beverages (seems reasonable).
Now, determine a summary statistic that we will use to compare beers based on your beer goodnees indicator variable. Examples include mean
, median
, std
, max
, min
, etc. Write 1-2 sentences describing why you chose the statistic you chose for your variable(s). You can use annectodal evidence (some reasoning why you think that summary statistics would be appropriate/useful here), or look at the distribution based on plots, or summary statistics to pick your preferred summary statistics for this case.
If you are making a plot, please be sure to use the |
If you wanted to have some fun, you could decide to combine different variables into a single one. For instance, maybe you want to take into consideration both |
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences describing what is your
beer_goodness_indicator
(variable and summary statistics), and why.
Question 2
Now that we have decided how to compare beers, let’s create a new variable called beer_goodness_indicator
in the reviews dataset. For each beer_id
, summarize
the reviews
data to get a single beer_goodness_indicator
based on your answer from question 1. Call this summarized dataset reviews_summary
.
|
|
You may be wondering what the heck the It could be as simple as getting the
You could instead use pipes:
Why? This second version is arguably easier to read, and it is easier to edit. You could easily want to add a column to the dataframe first.
Now, if we had the non-piped version it would be something like:
Or an even better example would be:
Versus:
|
|
-
Code used to solve this problem.
-
Output from running the code.
-
Head of
reviews_summary
dataset.
Question 3
Let’s combine our beers
dataset with reviews_summary
into a new dataset called beers_reviews
that contains only beers that appears in both datasets. Use the appropriate join
function from tidyverse
(inner_join
, left_join
, right_join
, or full_join
) to solve this problem. Since you saw some examples using pipes in the previous question (%>%
) — use pipes from here on out.
What are the dimensions of the resulting beers_reviews
dataset? How many beers did not appear in both datasets?
-
Code used to solve this problem.
-
Output from running the code.
-
Result of running
dim(beers_reviews)
Question 4
Ok, now we have the dataset ready to analyze! For beers that are available during the entire year (see availability
), is there a difference between retired
and not retired beers in terms of beer_goodness_indicator
?
-
Start by subsetting the dataset using
filter
. -
Create some data-driven method to answer this question. You can make a plot, get summary statistics (average
beer_goodness_indicator
, table comparing # of beers withbeer_goodness_indicator
> 4 for each category, etc). You can use multiple methods to answer this question! Have fun!
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences answering the comparing
retired
and not retired beers in terms ofbeer_goodness_indicator
based on your chosen method(s). Did the results surprise you? -
1-2 sentences explaining what data-driven method(s) you decided to use and why.
Question 5
Let’s compare different styles of beer based on our beer_goodness_indicator
average. Create a Cleveland dotplot (using ggplot
) comparing the average beer_goodness_indicator
for each style in beers_reviews
. Make sure to use the tidyverse
functions to answer this question and to use ggplot
.
The code below creates a Cleveland dotplot comparing
|
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |