Reshaping pandas
Objects
Sometimes the data you receive is inconvenient — all of it can be useful, but you might have received it with too many redundant columns or a column that would be more useful as an index. To accomodate this, pandas
has several functions and practices that allow you to reshape your data.
pivot
The pivot
method is the main method for reshaping, taking arguments for index
, columns
, and values
, whose applications are very straightforward.
-
index
: the originalDataFrame
column whose values will become the pivotedDataFrame
indices. -
columns
: the originalDataFrame
column whose values will become the pivotedDataFrame
column names. -
values
: the originalDataFrame
column whose values will occupy the pivotedDataFrame
values.
Given the sample DataFrame
below, let’s use pivot
to set location
as our index values, week_of_season
as the new columns, and attendance
as our values:
import pandas as pd
list_1 = ['Wisconsin', 'IU', 'Rutgers', 'Michigan State', 'Ohio State']
list_2 = ['home', 'away', 'home', 'away', 'home']
list_3 = [85, 78, 90, 75, 74]
list_4 = [500, 1000, 430, 4800, 10000]
list_5 = [1, 2, 3, 4, 5]
myDF = pd.DataFrame(zip(list_1, list_2, list_3, list_4, list_5),
columns=['opponent', 'location', 'temp', 'attendance', 'week_of_season'])
print(myDF.pivot(index='location', columns='week_of_season', values='attendance'))
week_of_season 1 2 3 4 5 location away NaN 1000.0 NaN 4800.0 NaN home 500.0 NaN 430.0 NaN 10000.0
pivot
is easy to use and manipulate, as any column can take the place of index
, columns
, and values
depending on your application.
A notable feature is that values
can be omitted from the function call, resulting in a higher-order column name for each of the remaining columns, then your values specified in the columns
parameter. We’ll repeat the above example, now making week_of_season
our index and location
our column values.
football_pivot = myDF.pivot(index='week_of_season', columns='location')
print(football_pivot)
opponent temp attendance location away home away home away home week_of_season 1 NaN Wisconsin NaN 85.0 NaN 500.0 2 IU NaN 78.0 NaN 1000.0 NaN 3 NaN Rutgers NaN 90.0 NaN 430.0 4 Michigan State NaN 75.0 NaN 4800.0 NaN 5 NaN Ohio State NaN 74.0 NaN 10000.0
Cool! The benefit here is that you have a correctly-shaped DataFrame
that has all the values you (or your partners) might need in the future. Selecting one column uses standard indexing from this point:
print(football_pivot['opponent'])
location away home week_of_season 1 NaN Wisconsin 2 IU NaN 3 NaN Rutgers 4 Michigan State NaN 5 NaN Ohio State
melt
The features of melt
are split into identifier variables and measured variables:
-
The identifiers are not indices, but they are static parts of your data that you want to maintain, such as names or locations. These are usually duplicated for each measured variable you have.
-
Measured variables are all the non-identifier columns. They are melted into the column
variable
, which contains the names of each measured column, andvalue
, each column’s corresponding value for that identifier.
We’ll show an example before we get too wordy:
import pandas as pd
hail = pd.DataFrame(
{
"first": ["Purdue", "Dr"],
"last": ["Pete", "Ward"],
"topic": ["Cheerleading", "Progeny"],
"score": [100, 7],
}
)
print(hail)
first last topic score 0 Purdue Pete Cheerleading 100 1 Dr Ward Progeny 7
print(hail.melt(id_vars=['first', 'last']))
first last variable value 0 Purdue Pete topic Cheerleading 1 Dr Ward topic Progeny 2 Purdue Pete score 100 3 Dr Ward score 7
We can interpret melt
as an anti-pivot
— we’re minimizing the number of columns by putting more data in rows, resulting in repetition of our identifier variable(s) for each measured variable.
In addition to selecting your identifier variables, you can select your measured variables using value_vars=
, change "variable" to something else using var_name=
, and change "value" to something else using value_name=
.