Reshaping Data#

In this module we will cover different ways to reshape data to summarize large datasets into useful information. This page covers the different functions using a bite-sized example.

Food Table#

This simple table contains a list of 20 foods along with category, color, and price. We will use this table to demonstrate the different functions for reshaping data.

import pandas as pd

food_data = pd.read_csv('food_table.csv')
food_data
Food Category Price Color
0 Apple Fruit 1.2 Red
1 Banana Fruit 0.9 Yellow
2 Carrot Vegetable 0.5 Yellow
3 Tomato Vegetable 0.7 Red
4 Chicken Protein 3.0 Brown
5 Salmon Protein 4.5 Red
6 Brown Rice Grain 2.0 Brown
7 Broccoli Vegetable 1.1 Green
8 Strawberry Fruit 2.0 Red
9 Cheddar Cheese Dairy 2.5 Yellow
10 Yogurt Dairy 1.0 Yellow
11 Blueberry Fruit 2.2 Red
12 Lentils Protein 1.6 Brown
13 Spinach Vegetable 1.2 Green
14 Tofu Protein 1.8 Brown
15 Quinoa Grain 3.1 Brown
16 Oatmeal Grain 2.5 Brown
17 Egg Protein 0.2 Yellow
18 Grapes Fruit 2.0 Red
19 Potato Vegetable 0.7 Brown

Groupby#

First, we will cover groupby. This function is used to group the data by a certain column and then apply a function to other columns. The groupby function is useful when you want to split your data into groups based on some criteria and then perform calculations on each group separately.

For example, if you want to know:

  • How many foods of each color are there?

  • What is the average price of foods in each category?

  • What is the most expensive food in each category?

The general syntax is:

df.groupby('column_to_group_by')['column_to_analyze'].function()

Let’s look at some examples below.

Aggregate#

In aggregation, we summarize multiple values into a single value. This can be the count, mean, sum, median, or any provided function.

# Group by color and count the number of foods in each

food_data.groupby('Color')['Food'].count()
Color
Brown     7
Green     2
Red       6
Yellow    5
Name: Food, dtype: int64
# Group by category and count the number of foods in each

food_data.groupby('Category')['Food'].count()
Category
Dairy        2
Fruit        5
Grain        3
Protein      5
Vegetable    5
Name: Food, dtype: int64
# Group by category and find the mean price of foods in each category

food_data.groupby('Category')['Price'].mean()
Category
Dairy        1.750000
Fruit        1.660000
Grain        2.533333
Protein      2.220000
Vegetable    0.840000
Name: Price, dtype: float64
# You can also apply multiple functions at once
food_data.groupby('Category')['Price'].agg(['mean', 'count'])
mean count
Category
Dairy 1.750000 2
Fruit 1.660000 5
Grain 2.533333 3
Protein 2.220000 5
Vegetable 0.840000 5

Transform#

Sometimes, you need to have a new column that is a transformation of the existing column.

In our example, let’s put a rank for the most cheapest food in each category. This new column needs to be the same length as the original column, so, the .agg method won’t work. Instead we use the .transform method.

food_data['Rank'] = food_data.groupby('Category')['Price'].transform('rank')
food_data.sort_values(['Category', 'Price'], ascending=True)
Food Category Price Color Rank
10 Yogurt Dairy 1.0 Yellow 1.0
9 Cheddar Cheese Dairy 2.5 Yellow 2.0
1 Banana Fruit 0.9 Yellow 1.0
0 Apple Fruit 1.2 Red 2.0
8 Strawberry Fruit 2.0 Red 3.5
18 Grapes Fruit 2.0 Red 3.5
11 Blueberry Fruit 2.2 Red 5.0
6 Brown Rice Grain 2.0 Brown 1.0
16 Oatmeal Grain 2.5 Brown 2.0
15 Quinoa Grain 3.1 Brown 3.0
17 Egg Protein 0.2 Yellow 1.0
12 Lentils Protein 1.6 Brown 2.0
14 Tofu Protein 1.8 Brown 3.0
4 Chicken Protein 3.0 Brown 4.0
5 Salmon Protein 4.5 Red 5.0
2 Carrot Vegetable 0.5 Yellow 1.0
3 Tomato Vegetable 0.7 Red 2.5
19 Potato Vegetable 0.7 Brown 2.5
7 Broccoli Vegetable 1.1 Green 4.0
13 Spinach Vegetable 1.2 Green 5.0

Now, all of the Rank values are calculated within each category.

Pivot#

Pivoting is a way to transform the data from a long format to a wide format. In our example, the data is long because each food has its own row. To convert it to a wide format, we need to decide the new rows, columns, and what goes at their intersection. These new rows and columns should come from existing categorical columns in the dataset.

For example, we can create a table that shows:

  • Rows: Food categories

  • Columns: Food colors

  • Values: Count of foods in each category-color combination

This gives us a quick way to see patterns like:

  • Which colors are most common in each food category?

  • Are there any category-color combinations with no foods?

The pivot_table function in pandas makes this transformation easy. The basic syntax is:

pd.pivot_table(df,              # DataFrame to pivot
               index='col1',    # Column to use for new rows
               columns='col2',  # Column to use for new columns 
               values='col3',   # Values to put in cells
               aggfunc='func')  # How to aggregate multiple values
pd.pivot_table(food_data,         # Dataframe to pivot
               index='Category',  # New rows
               columns='Color',   # New columns
               values='Food',     # Values to consider
               aggfunc='count',   # Function to aggregate multiple values into a single value
               fill_value=0)      # Fill missing values with
Color Brown Green Red Yellow
Category
Dairy 0 0 0 2
Fruit 0 0 4 1
Grain 3 0 0 0
Protein 3 0 1 1
Vegetable 1 2 1 1

Or we can change the values to price and find the mean price of foods in each category and color.

pd.pivot_table(food_data,         # Dataframe to pivot
               index='Category',  # New rows
               columns='Color',   # New columns
               values='Price',    # Values to consider
               aggfunc='mean',    # Function to aggregate multiple values into a single value
               fill_value=None)   # Leave missing values empty
Color Brown Green Red Yellow
Category
Dairy NaN NaN NaN 1.75
Fruit NaN NaN 1.85 0.90
Grain 2.533333 NaN NaN NaN
Protein 2.133333 NaN 4.50 0.20
Vegetable 0.700000 1.15 0.70 0.50

Conclusion#

In this section, we have covered the following functions:

  • groupby: Used to group the data by a certain column and then apply a function to other columns.

  • pivot_table: Used to summarize the data by pivoting it from a long format to a wide format.