STAT 19000: Project 3 — Spring 2022
Motivation: We’ve now been introduced to a variety of core Python data structures. Along the way we’ve touched on a bit of pandas, matplotlib, and have utilized some control flow features like for loops and if statements. We will continue to touch on pandas and matplotlib, but we will take a deeper dive in this project and learn more about control flow, all while digging into the data!
Context: We just finished a project where we were able to see the power of dictionaries and sets. In this project we will take a step back and make sure we are able to really grasp control flow (if/else statements, loops, etc.) in Python.
Scope: Python, dicts, lists, if/else statements, for loops, break, continue
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
Let’s begin this project by taking another look at question (4) from the previous project.
Although we were able to reduce the number of comparisons down a lot (from around 15000000 squared to 40000 squared) — it is still terrible and very very slow.
To see just how slow, let’s time it!
from block_timer.timer import Timer
import pandas as pd
# read in the intruder dataset and get the unique ids
df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
intruder_ids = df_intruder["station_id"].dropna().tolist()
unique_intruder_ids = list(set(intruder_ids))
# read in the original dataset and get the unique ids
df_original = pd.read_csv('/depot/datamine/data/noaa/2020_sample.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
original_ids = df_original["station_id"].dropna().tolist()
unique_ids = list(set(original_ids))
with Timer():
# compare the two lists
for i in unique_intruder_ids:
if i not in unique_ids:
print(i)
Yikes! That’s really not very good!
So, what is the better way? To take advantage of the set
object! Specifically, read the section titled "Operating on a Set" here, and think of a better way to get this value! Test out the new method — how fast was it compared to the method above?
On Brown, mine was 958 times faster than the original method! Definitely a worthwhile trick to use! |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Unlike in R, where traditional loops are rare and typically accomplished via one of the apply functions, in Python, loops are extremely common and important to understand. In Python, any iterator can be looped over. Some common iterators are: tuples, lists, dicts, sets, pandas Series, and pandas DataFrames.
Let’s get started by reading in our dataset and taking a look.
import pandas as pd
df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";")
Use the following code to extract the sales amount in dollars into a list.
sales_list = df['Sale (Dollars)'].dropna().tolist()
Write a loop that uses sales_list
and sums up the total sales, and prints the average sales amount.
Of course, pandas
provides a method to iterate over the Sale (Dollars)
Series as well! It would start as follows.
for idx, val in df['Sale (Dollars)'].dropna().iteritems():
# put code here for series loop
Use this method to calculate the average sales amount. Which is faster? Fill in the following skeleton code to find out.
from block_timer.timer import Timer
with Timer(title="List loop"):
# code for list loop
with Timer(title="Series loop"):
# code for series loop
-
Code used to solve this problem.
-
Output from running the code.
Question 3
You may have been surprised by the fact that iterating through the Series was slower than iterating through a list. Here is a good post explaining why it is so slow!
So why use pandas
? Well, it starts to be pretty great when you can take advantage of vectorization.
Let’s do a new exercise. Instead of calculating the average sales amount, let’s calculate the z-scores of the sales amounts. Just like before, do this using 2 methods. The first is to just use for loops, the len
function, and the sum
function. The second is to use pandas
. I’ve provided you with the pandas solution.
How do you calculate a z-score?
$\frac{x_i - \mu}{\sigma}$
Where
$\sigma = \sqrt{\sum_{i=0}^n{\frac{(x_i - \mu)^{2}}{n}}}$
$n$ is the number of elements in the list.
$x_i$ is the ith element in the list.
$\mu$ is the mean of the list.
$\sigma$ is the standard deviation of the list.
Give it a shot and fill in the code below. What do the results look like?
import pandas as pd
from block_timer.timer import Timer
# df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";")
sales_list = df['Sale (Dollars)'].dropna().tolist()
with Timer(title="Loops"):
# calculate the mean
mean = sum(sales_list)/len(sales_list)
# calculate the std deviation
# you can use **2 to square a value and
# **0.5 to square root a value
# calculate the list of z-scores
# print the first 5 z-scores
print(zscores[:5])
with Timer(title="Vectorization"):
print(((df['Sale (Dollars)'] - df['Sale (Dollars)'].mean())/df['Sale (Dollars)'].std()).iloc[0:5])
-
Code used to solve this problem.
-
Output from running the code.
Question 4
While it is nearly always best to try and vectorize your code when using pandas
, sometimes it isn’t possible to do perfectly, or it just isn’t worth the time to do it. For this question, we don’t care about vectorization.
We want to look at Volume Sold (Gallons)
by Store Number
. Start by building a dict called volume_dict
that maps Store Number
to Volume Sold (Gallons)
.
Since we only care about those two columns now, let’s remove the rest.
df = df.loc[:, ('Store Number', 'Volume Sold (Gallons)')]
You can loop through the DataFrame as follows.
for idx, row in df.iterrows():
# print(idx, row)
There, idx
contains the row index, and row
contains a Series object containing the row of data. You could then access either of the column using either row['Store Number']
or row['Volume Sold (Gallons)']
.
Build your volume_dict
.
Remember, you will need to instantiate each key in the dict to prevent `KeyError`s. Alternatively, you can use a defaultdict. A defaultdict is a dict that will automatically instantiate a new key to a particular value. You could for example do the following.
Then, by default, all keys will be instantiated to 0. |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Great! Now you have your volume_dict
. Write a loop that loops through your volume_dict
and prints the Store Number
and Volume Sold (Gallons)
for each key. If the volume sold is less than 100000 use the continue
keyword to skip printing anything. If the volumn sold is greater than 149999, print "HIGH: " before the store number, if the volume sold is less than 150000 print "LOW: " before the store number.
The output should be the following.
LOW: 2190.0 HIGH: 4829.0 HIGH: 2633.0 HIGH: 2512.0 LOW: 3494.0 LOW: 2625.0 HIGH: 3420.0 LOW: 3952.0 HIGH: 3385.0 LOW: 3354.0 LOW: 3814.0
The |
In Python, there is if/elif/else. Elif stands for "else if". |
To iterate through a dictionary, you can use the |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |