TDM 20100: Project 7 — 2022
Motivation: awk
is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk
.
Context: This is the third of three projects where we introduce awk
. awk
is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
Take a look at the dataset. You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.
The data will look something like the following:
Store Location POINT (-91.716615 41.963516) POINT (-91.6537 41.987286) POINT (-91.52888 40.962331000000006) POINT (-93.596755 41.5464) POINT (-91.658105 42.010971) POINT (-91.494611 41.807199) POINT (-91.796988 43.307662) POINT (-91.358467 41.280183)
What this means is that you can’t just parse out the latitude and longitude coordinates and call it a day — you need to use awk
functions like gsub
and split
to extract the latitude and longitude coordinates.
Use awk
to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.
lat;lon 1.23;4.56
Make sure to take care of rows that don’t have latitude and longitude coordinates — just skip them. So if your results look like this, you need to add logic to skip the "empty" rows: -91.716615 41.963516 -91.6537 41.987286 -91.52888 40.962331000000006 -93.596755 41.5464 -91.658105 42.010971 -91.494611 41.807199 -91.796988 43.307662 -91.358467 41.280183 To do this, just go ahead and wrap your print in an if statement similar to:
|
|
If we have a bunch of data formatted like the following: POINT (-91.716615 41.963516) If we first used
-91.716615 41.963516) Then, you could use
-91.716615 41.963516 At this point I’m sure you can see how to use |
Don’t forget any lingering space after the first comma! We don’t want that. |
To verify your
output
41.963516;-91.716615 41.987286;-91.6537 40.962331000000006;-91.52888 41.5464;-93.596755 42.010971;-91.658105 41.807199;-91.494611 43.307662;-91.796988 41.280183;-91.358467 |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Use awk
to create a new dataset called sales_by_store.csv
. Include the lat
and lon
you figured out how to parse in the previous question. The final columns should be the following.
store_name;date;sold_usd;volume_sold;lat;lon
Please exclude all rows that do not have latitude and longitude values. Save volume sold as liters, not gallons.
You can output the results of the
The |
To verify your output, the results from piping the first 10 lines of our dataset to your
output
store_name;date;sold_usd;volume_sold;lat;lon CVS PHARMACY #8443 / CEDAR RAPIDS;08/16/2012;5.25;41.963516;-91.716615 SMOKIN' JOE'S #6 TOBACCO AND LIQUOR;09/10/2014;9;41.987286;-91.6537 HY-VEE FOOD STORE / MOUNT PLEASANT;04/10/2013;1.5;40.962331000000006;-91.52888 AFAL FOOD & LIQUOR / DES MOINES;08/30/2012;1.12;41.5464;-93.596755 HY-VEE FOOD STORE #5 / CEDAR RAPIDS;01/26/2015;3;42.010971;-91.658105 SAM'S MAINSTREET MARKET / SOLON;07/19/2012;12;41.807199;-91.494611 DECORAH MART;10/23/2013;9;43.307662;-91.796988 ECON-O-MART / COLUMBUS JUNCTION;05/02/2012;2.25;41.280183;-91.358467 |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Believe it or not, awk
even supports geometric calculations like sin
and cos
. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.
Okay, so how to get started? To calculate this, we can use the Haversine formula. The formula is:
$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
Where:
-
$r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-
$\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-
$\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
In awk
, sin
is sin
, cos
is cos
, and sqrt
is sqrt
.
To get the arcsin
use the following awk
function:
function arcsin(x) { return atan2(x, sqrt(1-x*x)) }
To convert from degrees to radians, use the following awk
function:
function dtor(x) { return x*atan2(0, -1)/180 }
The following is how the script should work (with a real example you can test):
%%bash
./question3.sh 40.39978 -91.387531 40.739238 -95.02756
309.57
To include functions in your
|
We want you to create a bash script called The following is some skeleton code that you can use to get started.
|
You may need to give your script execute permissions like this.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Find the latitude and longitude points for two interesting points on a map (it could be anywhere). Make a note of the locations and the latitude and longitude values for each point in a markdown cell.
Use your question.sh
script to determine the distance. How close is the distance to the distance you get from an online map app? Pretty close?
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |