EDA: Exploratory Data Analysis

Exploratory Data Analysis or EDA is one of the most important steps in the model building process. During EDA you learn about the data that you have available to you, develop some questions or goals with the data (if you don’t already have them), and figure out preprocessing steps. Some common EDA tasks include:

  • Does the data have missing values?

  • Does the data contain outliers?

  • How is the data distributed?

  • General data information: how many features, what is the shape, etc.

  • Take a peek at the first few rows of data

  • Does the data have any odd trends over time?

  • Are the variables related to each other? (Correlation or covariance)

  • Are there other features we could create that would benefit the model?

Basic Functions in EDA

Here we list some commonly used functions used for EDA. You can explore how they get used in the EDA notebook in the code examples below.

The following functions are for the pandas package, but most data manipulation packages (like numpy) will have similar functionality, sometimes with the same names even.
  • head()

  • describe()

  • shape

  • unique()

  • info()

Code Examples

All of the code examples are written in Python, unless otherwise noted.

Containers

These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so.
Quickstart: Download Docker, then run the commands below in a terminal.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:eda

#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:eda

Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!