Web Scraping

Web scraping is the automated retrieval of data from web pages. Web scraping is one of the foundational tools of our modern web; for instance, Google was originally a web scraping system on steroids (typically called crawlers) designed to scrape the entire web. While today Google has grown beyond just its crawler system, web scraping is still incredibly important to retrieving data in an automated way for many organizations, not just Google.

Common Applications

Common Industries

Many industries utilize web scraping to keep track of market changes, product prices, etc. For instance:

Retail: scraping product prices from a competitors website to automatically keep track of their prices
Marketing/Advertising/Communication: companies like Syften monitor web communications using scraping systems and use that to help companies find customers talking about their products (or, their competitors products)
Social Media: one of the biggest targets for scraping, because public sentiment can be analyzed here (see Sentiment Analysis)

Code Examples

All of the code examples are written in Python, unless otherwise noted.

TDM Seminar Scraping Examples

If you aren’t enrolled in the particular Seminar, then this is just for practice. Nonetheless the Seminar courses are fairly in depth and are great to learn from!

TDM20200 Spring (Projects 2-6)

TDM40200 Fall (Projects 10-13)

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.

#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:web-scraping-intro

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:web-scraping-intro

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!

Resources

All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free.

Web Scraping

Common Applications

Common Industries

Code Examples

TDM Seminar Scraping Examples

Containers

Resources

Videos

Books

Articles

Free Courses