Reshaping Data#
In this module we will cover different ways to reshape data to summarize large datasets into useful information. This page covers the different functions using a bite-sized example.
Food Table#
This simple table contains a list of 20 foods along with category, color, and price. We will use this table to demonstrate the different functions for reshaping data.
import pandas as pd
food_data = pd.read_csv('food_table.csv')
food_data
Food | Category | Price | Color | |
---|---|---|---|---|
0 | Apple | Fruit | 1.2 | Red |
1 | Banana | Fruit | 0.9 | Yellow |
2 | Carrot | Vegetable | 0.5 | Yellow |
3 | Tomato | Vegetable | 0.7 | Red |
4 | Chicken | Protein | 3.0 | Brown |
5 | Salmon | Protein | 4.5 | Red |
6 | Brown Rice | Grain | 2.0 | Brown |
7 | Broccoli | Vegetable | 1.1 | Green |
8 | Strawberry | Fruit | 2.0 | Red |
9 | Cheddar Cheese | Dairy | 2.5 | Yellow |
10 | Yogurt | Dairy | 1.0 | Yellow |
11 | Blueberry | Fruit | 2.2 | Red |
12 | Lentils | Protein | 1.6 | Brown |
13 | Spinach | Vegetable | 1.2 | Green |
14 | Tofu | Protein | 1.8 | Brown |
15 | Quinoa | Grain | 3.1 | Brown |
16 | Oatmeal | Grain | 2.5 | Brown |
17 | Egg | Protein | 0.2 | Yellow |
18 | Grapes | Fruit | 2.0 | Red |
19 | Potato | Vegetable | 0.7 | Brown |
Groupby#
First, we will cover groupby
. This function is used to group the data by a certain column and then apply a function to other columns.
The groupby function is useful when you want to split your data into groups based on some criteria and then perform calculations on each group separately.
For example, if you want to know:
How many foods of each color are there?
What is the average price of foods in each category?
What is the most expensive food in each category?
The general syntax is:
df.groupby('column_to_group_by')['column_to_analyze'].function()
Let’s look at some examples below.
Aggregate#
In aggregation, we summarize multiple values into a single value. This can be the count, mean, sum, median, or any provided function.
# Group by color and count the number of foods in each
food_data.groupby('Color')['Food'].count()
Color
Brown 7
Green 2
Red 6
Yellow 5
Name: Food, dtype: int64
# Group by category and count the number of foods in each
food_data.groupby('Category')['Food'].count()
Category
Dairy 2
Fruit 5
Grain 3
Protein 5
Vegetable 5
Name: Food, dtype: int64
# Group by category and find the mean price of foods in each category
food_data.groupby('Category')['Price'].mean()
Category
Dairy 1.750000
Fruit 1.660000
Grain 2.533333
Protein 2.220000
Vegetable 0.840000
Name: Price, dtype: float64
# You can also apply multiple functions at once
food_data.groupby('Category')['Price'].agg(['mean', 'count'])
mean | count | |
---|---|---|
Category | ||
Dairy | 1.750000 | 2 |
Fruit | 1.660000 | 5 |
Grain | 2.533333 | 3 |
Protein | 2.220000 | 5 |
Vegetable | 0.840000 | 5 |
Transform#
Sometimes, you need to have a new column that is a transformation of the existing column.
In our example, let’s put a rank for the most cheapest food in each category.
This new column needs to be the same length as the original column, so, the .agg method won’t work.
Instead we use the .transform
method.
food_data['Rank'] = food_data.groupby('Category')['Price'].transform('rank')
food_data.sort_values(['Category', 'Price'], ascending=True)
Food | Category | Price | Color | Rank | |
---|---|---|---|---|---|
10 | Yogurt | Dairy | 1.0 | Yellow | 1.0 |
9 | Cheddar Cheese | Dairy | 2.5 | Yellow | 2.0 |
1 | Banana | Fruit | 0.9 | Yellow | 1.0 |
0 | Apple | Fruit | 1.2 | Red | 2.0 |
8 | Strawberry | Fruit | 2.0 | Red | 3.5 |
18 | Grapes | Fruit | 2.0 | Red | 3.5 |
11 | Blueberry | Fruit | 2.2 | Red | 5.0 |
6 | Brown Rice | Grain | 2.0 | Brown | 1.0 |
16 | Oatmeal | Grain | 2.5 | Brown | 2.0 |
15 | Quinoa | Grain | 3.1 | Brown | 3.0 |
17 | Egg | Protein | 0.2 | Yellow | 1.0 |
12 | Lentils | Protein | 1.6 | Brown | 2.0 |
14 | Tofu | Protein | 1.8 | Brown | 3.0 |
4 | Chicken | Protein | 3.0 | Brown | 4.0 |
5 | Salmon | Protein | 4.5 | Red | 5.0 |
2 | Carrot | Vegetable | 0.5 | Yellow | 1.0 |
3 | Tomato | Vegetable | 0.7 | Red | 2.5 |
19 | Potato | Vegetable | 0.7 | Brown | 2.5 |
7 | Broccoli | Vegetable | 1.1 | Green | 4.0 |
13 | Spinach | Vegetable | 1.2 | Green | 5.0 |
Now, all of the Rank values are calculated within each category.
Pivot#
Pivoting is a way to transform the data from a long format to a wide format. In our example, the data is long because each food has its own row. To convert it to a wide format, we need to decide the new rows, columns, and what goes at their intersection. These new rows and columns should come from existing categorical columns in the dataset.
For example, we can create a table that shows:
Rows: Food categories
Columns: Food colors
Values: Count of foods in each category-color combination
This gives us a quick way to see patterns like:
Which colors are most common in each food category?
Are there any category-color combinations with no foods?
The pivot_table function in pandas makes this transformation easy. The basic syntax is:
pd.pivot_table(df, # DataFrame to pivot
index='col1', # Column to use for new rows
columns='col2', # Column to use for new columns
values='col3', # Values to put in cells
aggfunc='func') # How to aggregate multiple values
pd.pivot_table(food_data, # Dataframe to pivot
index='Category', # New rows
columns='Color', # New columns
values='Food', # Values to consider
aggfunc='count', # Function to aggregate multiple values into a single value
fill_value=0) # Fill missing values with
Color | Brown | Green | Red | Yellow |
---|---|---|---|---|
Category | ||||
Dairy | 0 | 0 | 0 | 2 |
Fruit | 0 | 0 | 4 | 1 |
Grain | 3 | 0 | 0 | 0 |
Protein | 3 | 0 | 1 | 1 |
Vegetable | 1 | 2 | 1 | 1 |
Or we can change the values to price and find the mean price of foods in each category and color.
pd.pivot_table(food_data, # Dataframe to pivot
index='Category', # New rows
columns='Color', # New columns
values='Price', # Values to consider
aggfunc='mean', # Function to aggregate multiple values into a single value
fill_value=None) # Leave missing values empty
Color | Brown | Green | Red | Yellow |
---|---|---|---|---|
Category | ||||
Dairy | NaN | NaN | NaN | 1.75 |
Fruit | NaN | NaN | 1.85 | 0.90 |
Grain | 2.533333 | NaN | NaN | NaN |
Protein | 2.133333 | NaN | 4.50 | 0.20 |
Vegetable | 0.700000 | 1.15 | 0.70 | 0.50 |
Conclusion#
In this section, we have covered the following functions:
groupby
: Used to group the data by a certain column and then apply a function to other columns.pivot_table
: Used to summarize the data by pivoting it from a long format to a wide format.