TDM 10100: Project 4 — Fall 2023
Many data science tools including have powerful ways to index data.
R typically has operations that are vectorized and there is little to no need to write loops.
if/else statements
create an order of direction based on a logical condition. if statement example:
else statement example:
In
|
Context: As we continue to become more familiar with R
this project will help reinforce the many ways of indexing data in R
.
Scope: R, data.frames, indexing.
Make sure to read about, and use the template found here, and the important information about projects submissions here.
Using the seminar-r kernel
Lets first see all of the files that are in the craigslist
folder
list.files("/anvil/projects/tdm/data/craigslist")
Remember:
file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
|
After looking at several of the files we will go ahead and read in the data frame on the Vehicles
myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)
It is important that, each time we look at data, we start by becoming familiar with the contents of the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.
This dataset has 25 columns. We are unable to see it all without adjusting the width. We can do this by
options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)
and we also remember (from the previous project) that we can set the output in R
to look more natural this way:
options(jupyter.rich_display = F)
To sort and order a single vector you can use this code:
You can also use the
|
vectorization
Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the []
symbol for indexing.
|
Questions
Question 1 (1.5 pts)
-
How many unique states are there in total? Which five of the states have the most occurrences?
-
How many cars have a price that is greater than or equal to $2000 ?
-
What is the average price of the vehicles in the dataset?
Question 2 (1.5 pts)
-
Create a new column
mileage_category
in your data.frame that categorize the vehicle’s mileage into different buckets by using thecut
function on theodometer
column.-
"Low": [0, 50000)
-
"Moderate": [50000, 100000)
-
"High": [100000, 150000)
-
"Very High": [150000, Inf)
-
-
Create a new column called
has_VIN
that flags whether or not the listing Vehicle has a VIN provided. -
Create a new column called
description_length
to categorize listings based on the length of their descriptions (in terms of the number of characters).-
"Very Short": [0, 50)
-
"Short": [50, 100)
-
"Medium": [100, 200)
-
"Long": [200, 500)
-
"Very Long": [500, Inf)
-
You may count number of characters using the
|
Remember to consider empty values and or |
Question 3 (1.5 pts)
-
Using the
table
function, and the new columnmileage_category
that you created in Question 2, find the number of cars in each of the different mileage categories. -
Using the
table
function, and the new columnhas_VIN
that you created in Question 2, identify how many vehicles have a VIN and how many do not have a VIN. -
Using the
table
function, and the new columndescription_length
that you created in Question 2, identify how many vehicles are in each of the categories of description length.
Question 4 (1.5 pts)
Preparing for Mapping
-
Extract all of the data for Texas into a data.frame called
myTexasDF
-
Identify the most popular state from myDF, and extract all of the data from that state into a data.frame called
popularStateDF
-
Create a third data.frame called
myFavoriteDF
with the data from a state of your choice
Question 5 (2 pts)
Mapping
-
Using the R package
leaflet
, make 3 maps of the USA, namely, one map for the data in each of thedata.frames
from question 4.
Submitting your Work
Well done, you’ve finished Project 4! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.
Project 4 Assignment Checklist
-
Code used to solve quesitons 1 to 5
-
All of your code and comments, and Output from running the code in a Jupyter Lab file:
-
firstname-lastname-project04.ipynb
.
-
-
All of your code and comments in an R File:
-
firstname-lastname-project04.R
.
-
-
submit files through Gradescope
You must double check your You will not receive full credit if your |
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |