Reshaping Data

Reshaping Data#

In this module we will cover different ways to reshape data to summarize large datasets into useful information. This page covers the different functions using a bite-sized example.

Food Table#

This simple table contains a list of 20 foods along with category, color, and price. We will use this table to demonstrate the different functions for reshaping data.

import pandas as pd

food_data = pd.read_csv('food_table.csv')
food_data

	Food	Category	Price	Color
0	Apple	Fruit	1.2	Red
1	Banana	Fruit	0.9	Yellow
2	Carrot	Vegetable	0.5	Yellow
3	Tomato	Vegetable	0.7	Red
4	Chicken	Protein	3.0	Brown
5	Salmon	Protein	4.5	Red
6	Brown Rice	Grain	2.0	Brown
7	Broccoli	Vegetable	1.1	Green
8	Strawberry	Fruit	2.0	Red
9	Cheddar Cheese	Dairy	2.5	Yellow
10	Yogurt	Dairy	1.0	Yellow
11	Blueberry	Fruit	2.2	Red
12	Lentils	Protein	1.6	Brown
13	Spinach	Vegetable	1.2	Green
14	Tofu	Protein	1.8	Brown
15	Quinoa	Grain	3.1	Brown
16	Oatmeal	Grain	2.5	Brown
17	Egg	Protein	0.2	Yellow
18	Grapes	Fruit	2.0	Red
19	Potato	Vegetable	0.7	Brown

Groupby#

First, we will cover groupby. This function is used to group the data by a certain column and then apply a function to other columns. The groupby function is useful when you want to split your data into groups based on some criteria and then perform calculations on each group separately.

For example, if you want to know:

How many foods of each color are there?
What is the average price of foods in each category?
What is the most expensive food in each category?

The general syntax is:

df.groupby('column_to_group_by')['column_to_analyze'].function()

Let’s look at some examples below.

Aggregate#

In aggregation, we summarize multiple values into a single value. This can be the count, mean, sum, median, or any provided function.

# Group by color and count the number of foods in each

food_data.groupby('Color')['Food'].count()

Color
Brown     7
Green     2
Red       6
Yellow    5
Name: Food, dtype: int64

# Group by category and count the number of foods in each

food_data.groupby('Category')['Food'].count()

Category
Dairy        2
Fruit        5
Grain        3
Protein      5
Vegetable    5
Name: Food, dtype: int64

# Group by category and find the mean price of foods in each category

food_data.groupby('Category')['Price'].mean()

Category
Dairy        1.750000
Fruit        1.660000
Grain        2.533333
Protein      2.220000
Vegetable    0.840000
Name: Price, dtype: float64

# You can also apply multiple functions at once
food_data.groupby('Category')['Price'].agg(['mean', 'count'])

	mean	count
Category
Dairy	1.750000	2
Fruit	1.660000	5
Grain	2.533333	3
Protein	2.220000	5
Vegetable	0.840000	5

Transform#

Sometimes, you need to have a new column that is a transformation of the existing column.

In our example, let’s put a rank for the most cheapest food in each category. This new column needs to be the same length as the original column, so, the .agg method won’t work. Instead we use the .transform method.

food_data['Rank'] = food_data.groupby('Category')['Price'].transform('rank')
food_data.sort_values(['Category', 'Price'], ascending=True)

	Food	Category	Price	Color	Rank
10	Yogurt	Dairy	1.0	Yellow	1.0
9	Cheddar Cheese	Dairy	2.5	Yellow	2.0
1	Banana	Fruit	0.9	Yellow	1.0
0	Apple	Fruit	1.2	Red	2.0
8	Strawberry	Fruit	2.0	Red	3.5
18	Grapes	Fruit	2.0	Red	3.5
11	Blueberry	Fruit	2.2	Red	5.0
6	Brown Rice	Grain	2.0	Brown	1.0
16	Oatmeal	Grain	2.5	Brown	2.0
15	Quinoa	Grain	3.1	Brown	3.0
17	Egg	Protein	0.2	Yellow	1.0
12	Lentils	Protein	1.6	Brown	2.0
14	Tofu	Protein	1.8	Brown	3.0
4	Chicken	Protein	3.0	Brown	4.0
5	Salmon	Protein	4.5	Red	5.0
2	Carrot	Vegetable	0.5	Yellow	1.0
3	Tomato	Vegetable	0.7	Red	2.5
19	Potato	Vegetable	0.7	Brown	2.5
7	Broccoli	Vegetable	1.1	Green	4.0
13	Spinach	Vegetable	1.2	Green	5.0

Now, all of the Rank values are calculated within each category.

Pivot#

Pivoting is a way to transform the data from a long format to a wide format. In our example, the data is long because each food has its own row. To convert it to a wide format, we need to decide the new rows, columns, and what goes at their intersection. These new rows and columns should come from existing categorical columns in the dataset.

For example, we can create a table that shows:

Rows: Food categories
Columns: Food colors
Values: Count of foods in each category-color combination

This gives us a quick way to see patterns like:

Which colors are most common in each food category?
Are there any category-color combinations with no foods?

The pivot_table function in pandas makes this transformation easy. The basic syntax is:

pd.pivot_table(df,              # DataFrame to pivot
               index='col1',    # Column to use for new rows
               columns='col2',  # Column to use for new columns 
               values='col3',   # Values to put in cells
               aggfunc='func')  # How to aggregate multiple values

pd.pivot_table(food_data,         # Dataframe to pivot
               index='Category',  # New rows
               columns='Color',   # New columns
               values='Food',     # Values to consider
               aggfunc='count',   # Function to aggregate multiple values into a single value
               fill_value=0)      # Fill missing values with

Color	Brown	Green	Red	Yellow
Category
Dairy	0	0	0	2
Fruit	0	0	4	1
Grain	3	0	0	0
Protein	3	0	1	1
Vegetable	1	2	1	1

Or we can change the values to price and find the mean price of foods in each category and color.

pd.pivot_table(food_data,         # Dataframe to pivot
               index='Category',  # New rows
               columns='Color',   # New columns
               values='Price',    # Values to consider
               aggfunc='mean',    # Function to aggregate multiple values into a single value
               fill_value=None)   # Leave missing values empty

Color	Brown	Green	Red	Yellow
Category
Dairy	NaN	NaN	NaN	1.75
Fruit	NaN	NaN	1.85	0.90
Grain	2.533333	NaN	NaN	NaN
Protein	2.133333	NaN	4.50	0.20
Vegetable	0.700000	1.15	0.70	0.50

Conclusion#

In this section, we have covered the following functions:

groupby: Used to group the data by a certain column and then apply a function to other columns.
pivot_table: Used to summarize the data by pivoting it from a long format to a wide format.