Web Scraping
Web scraping is the automated retrieval of data from web pages. Web scraping is one of the foundational tools of our modern web; for instance, Google was originally a web scraping system on steroids (typically called crawlers) designed to scrape the entire web. While today Google has grown beyond just its crawler system, web scraping is still incredibly important to retrieving data in an automated way for many organizations, not just Google.
Common Applications
Common Industries
Many industries utilize web scraping to keep track of market changes, product prices, etc. For instance:
-
Retail: scraping product prices from a competitors website to automatically keep track of their prices
-
Marketing/Advertising/Communication: companies like Syften monitor web communications using scraping systems and use that to help companies find customers talking about their products (or, their competitors products)
-
Social Media: one of the biggest targets for scraping, because public sentiment can be analyzed here (see Sentiment Analysis)
Code Examples
All of the code examples are written in Python, unless otherwise noted. |
TDM Seminar Scraping Examples
If you aren’t enrolled in the particular Seminar, then this is just for practice. Nonetheless the Seminar courses are fairly in depth and are great to learn from! |
Containers
These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so. |
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:web-scraping-intro
#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:web-scraping-intro
Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!