Walkthrough#
Introduction#
An emerging area of biomedical research over the past decade as been the human microbiome. This field studies the commensal bacteria that inhabit our bodies and how they influence our health. These can be found everywhere from our digestive system, to our skin, to our ears, and to every part of our body. Often times hundreds of different bacterial species can be isolated from a single body site of a single individual. Disease can be caused or exacerbated by an imbalance in these species.
This week, we will explore the data generated by researchers here at Drexel. From a collection of 12 patients they measured the microbiome of of 12 body sites in their nasal passages. Some of these patients had inner ear infections (otitis media) and different disease outcomes. This week, we will use Python to generate pivot-tables and bar-plots to understand whether the microbiome is impacted by disease outcome.
Learning Objectives#
At the end of this learning activity you will be able to:
Practice using
query
to extract data from a larger table.Calculate summary values across a a
pd.DataFrame
using methods like sum, mean, and max.Utilize
pd.DataFrame.groupby
to aggregate and transform data by group.Use
pd.merge
to combine data held in two different tables.Employ
pd.pivot_table
andpd.melt
to reshape and summarize data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('microbiome_phylum_data.csv', sep = '\t')
data
Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | |
---|---|---|---|---|---|---|---|---|---|
0 | 3062 | Nasal Vestibule | Swab | 2516 | 44 | 14987 | 0 | 16 | Firmicutes |
1 | 3094 | Nasal Vestibule | Swab | 103 | 0 | 1397 | 0 | 15 | Firmicutes |
2 | 3095 | Nasal Vestibule | Swab | 1474 | 0 | 5510 | 29 | 21 | Firmicutes |
3 | 3115 | Nasal Vestibule | Swab | 0 | 0 | 5480 | 0 | 2 | Firmicutes |
4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103 | 3094 | Sphenoid Tissue | Biopsy | 1540 | 0 | 784 | 13 | 14 | Actinobacteria |
104 | 3095 | Sphenoid Tissue | Biopsy | 670 | 0 | 703 | 0 | 14 | Firmicutes |
105 | 3116 | Sphenoid Tissue | Biopsy | 309 | 9 | 6709 | 495 | 17 | Firmicutes |
106 | 3117 | Sphenoid Tissue | Biopsy | 1812 | 0 | 1954 | 129 | 18 | Firmicutes |
107 | 3119 | Sphenoid Tissue | Biopsy | 4183 | 0 | 988 | 13 | 23 | Actinobacteria |
108 rows × 9 columns
Exploring a single patient#
First, we’ll explore the distribution of bacteria of a single individual across body site.
Q1: Extract the information for patient 3116#
pat_3116 = data.query('Patient == 3116') # SOLUTION
pat_3116.head()
Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | |
---|---|---|---|---|---|---|---|---|---|
4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes |
14 | 3116 | Head of Inferior Turbinate Tissue | Biopsy | 19 | 0 | 6261 | 21 | 5 | Firmicutes |
25 | 3116 | Middle Meatus | Swab | 5 | 0 | 988 | 2 | 10 | Firmicutes |
36 | 3116 | Uncinate Process Tissue | Biopsy | 17 | 0 | 7249 | 79 | 12 | Firmicutes |
42 | 3116 | Maxillary Sinus | Swab | 34 | 0 | 5748 | 190 | 11 | Firmicutes |
grader.check("q1_extract_single")
Q2: Calculate the average count across regions for each phylum for patient 3116.#
q2_Actinobacteria_mean = pat_3116['Actinobacteria'].mean() # SOLUTION
q2_Bacteroidetes_mean = pat_3116['Bacteroidetes'].mean() # SOLUTION
q2_Firmicutes_mean = pat_3116['Firmicutes'].mean() # SOLUTION
q2_Proteobacteria_mean = pat_3116['Proteobacteria'].mean() # SOLUTION
grader.check("q2_summary_vals")
Summarizing by grouping#
Now that we’ve looked at the summary values for a single individual, how would we look at this for each individual?
Copy-pasting that over and over is unsustainable, DataFrame
s have useful methods for dealing with this problem.
All of these fall into the same basic strategy.
Split - Apply - Combine.
# Split
grouped_patients = data.groupby('Patient')
# Apply - Combine
# Capitalizing constants is useful if you will re-use them often.
PHYLUM_COLS = ['Actinobacteria', 'Bacteroidetes',
'Firmicutes', 'Proteobacteria']
mean_vals = grouped_patients[PHYLUM_COLS].mean()
mean_vals
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
Patient | ||||
3062 | 727.571429 | 7.714286 | 4513.714286 | 2.000000 |
3094 | 1356.900000 | 0.100000 | 2384.300000 | 79.100000 |
3095 | 1447.111111 | 0.111111 | 2753.666667 | 47.777778 |
3115 | 49.875000 | 1.375000 | 4266.875000 | 22.000000 |
3116 | 95.416667 | 1.250000 | 5135.500000 | 299.750000 |
3117 | 1164.333333 | 0.000000 | 1859.777778 | 69.000000 |
3118 | 1329.125000 | 3.125000 | 3271.125000 | 224.375000 |
3119 | 2692.727273 | 0.000000 | 1437.636364 | 5.727273 |
3120 | 1679.333333 | 0.000000 | 1435.555556 | 18.888889 |
3121 | 1205.428571 | 18.571429 | 3486.142857 | 29.714286 |
3123 | 1852.142857 | 12.285714 | 2536.428571 | 129.285714 |
3124 | 1603.545455 | 0.000000 | 1998.545455 | 39.272727 |
# This is commonly done in a single "sentence"
mean_vals = data.groupby('Patient')[PHYLUM_COLS].mean()
Q3: Calculate the average counts of each phylum by body site.#
q3_mean_phylum_site = data.groupby('Location')[PHYLUM_COLS].mean() # SOLUTION
q3_mean_phylum_site
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
Location | ||||
Ethmoid Culture (Deep to Ethmoid Bulla) | 1483.750000 | 3.416667 | 2757.416667 | 127.000000 |
Ethmoid Tissue (Deep to Ethmoid Bulla) | 1449.444444 | 0.777778 | 1715.555556 | 92.111111 |
Head of Inferior Turbinate Tissue | 317.000000 | 0.636364 | 2165.727273 | 24.727273 |
Maxillary Sinus | 1691.000000 | 0.000000 | 3002.750000 | 122.500000 |
Maxillary Sinus Tissue | 976.666667 | 0.000000 | 2966.000000 | 77.000000 |
Middle Meatus | 1158.363636 | 0.181818 | 3131.363636 | 43.000000 |
Nasal Vestibule | 1161.400000 | 12.500000 | 4373.000000 | 37.900000 |
Sphenoethmoidal Recess Tissue | 930.888889 | 4.000000 | 2019.666667 | 175.777778 |
Sphenoid | 1974.750000 | 5.416667 | 3821.000000 | 69.666667 |
Sphenoid Tissue | 1702.800000 | 1.800000 | 2227.600000 | 130.000000 |
Superior Meatus | 1704.416667 | 1.166667 | 3547.250000 | 127.416667 |
Uncinate Process Tissue | 868.900000 | 1.700000 | 2259.600000 | 40.700000 |
grader.check("q3_mean_by_site")
There are a number of different built-in summary functions like this.
data.groupby('Patient')[PHYLUM_COLS].median()
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
Patient | ||||
3062 | 461.0 | 2.0 | 2714.0 | 0.0 |
3094 | 1000.0 | 0.0 | 1467.5 | 41.5 |
3095 | 1223.0 | 0.0 | 1244.0 | 15.0 |
3115 | 15.0 | 0.0 | 2846.5 | 1.5 |
3116 | 37.0 | 0.0 | 6004.5 | 183.5 |
3117 | 1108.0 | 0.0 | 1800.0 | 66.0 |
3118 | 1023.5 | 0.0 | 3343.0 | 152.5 |
3119 | 2605.0 | 0.0 | 988.0 | 3.0 |
3120 | 468.0 | 0.0 | 1301.0 | 27.0 |
3121 | 911.0 | 12.0 | 3686.0 | 13.0 |
3123 | 1010.0 | 0.0 | 1207.0 | 63.0 |
3124 | 1602.0 | 0.0 | 2133.0 | 19.0 |
data.groupby('Patient')[PHYLUM_COLS].count()
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
Patient | ||||
3062 | 7 | 7 | 7 | 7 |
3094 | 10 | 10 | 10 | 10 |
3095 | 9 | 9 | 9 | 9 |
3115 | 8 | 8 | 8 | 8 |
3116 | 12 | 12 | 12 | 12 |
3117 | 9 | 9 | 9 | 9 |
3118 | 8 | 8 | 8 | 8 |
3119 | 11 | 11 | 11 | 11 |
3120 | 9 | 9 | 9 | 9 |
3121 | 7 | 7 | 7 | 7 |
3123 | 7 | 7 | 7 | 7 |
3124 | 11 | 11 | 11 | 11 |
data.groupby('Patient')[PHYLUM_COLS].max()
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
Patient | ||||
3062 | 2516 | 44 | 14987 | 6 |
3094 | 4604 | 1 | 5559 | 406 |
3095 | 3926 | 1 | 10097 | 166 |
3115 | 211 | 11 | 13948 | 108 |
3116 | 340 | 9 | 9796 | 1139 |
3117 | 1812 | 0 | 3671 | 129 |
3118 | 3709 | 12 | 5253 | 771 |
3119 | 7514 | 0 | 4372 | 18 |
3120 | 4195 | 0 | 2655 | 36 |
3121 | 2816 | 63 | 6135 | 78 |
3123 | 4186 | 81 | 5885 | 323 |
3124 | 4138 | 0 | 3979 | 183 |
You can see an extensive list of available summary functions at the Pandas Documentation
If there isn’t a function that does what you want, you can also make your own.
Here is a simple one that scales the data to a unit-norm.
def unit_norm(values):
"Given a series, return a scaled version"
mu = values.mean()
std = values.std()
return (values-mu)/std
unit_normed_data = data.groupby('Patient', as_index=False)[PHYLUM_COLS].transform(unit_norm)
unit_normed_data
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
---|---|---|---|---|
0 | 2.158109 | 2.247876 | 2.140374 | -0.755929 |
1 | -0.901854 | -0.316228 | -0.515722 | -0.660942 |
2 | 0.025250 | -0.333333 | 0.862714 | -0.303077 |
3 | -0.694809 | -0.353553 | 0.287822 | -0.549657 |
4 | -0.790041 | -0.441315 | -0.985677 | -0.838082 |
... | ... | ... | ... | ... |
103 | 0.131693 | -0.316228 | -0.835926 | -0.552316 |
104 | -0.729756 | -0.333333 | -0.641845 | -0.771142 |
105 | 1.806312 | 2.736155 | 0.551650 | 0.547734 |
106 | 1.315743 | NaN | 0.101533 | 1.390656 |
107 | 0.700800 | NaN | -0.331381 | 1.106277 |
108 rows × 4 columns
Notice I used the transform
method here instead of a common name.
When applying custom functions to groups of data there are three different methods depending on your final output shape:
.aggregate()
or.agg()
- Each group of data produces a single summary number. Commonly used to summarize groups..transform()
- The output will have the same number (and order) of rows as the input. Commonly used for normalizations..apply()
- Everything else.
Merging data#
Now we come to a common problem, our sample information is in a different file.
sample_info = pd.read_csv('sample_info.csv')
sample_info.head()
PID | severe_disease | disease_type | |
---|---|---|---|
0 | 3062 | False | persistent |
1 | 3094 | False | persistent |
2 | 3095 | False | persistent |
3 | 3115 | True | typical |
4 | 3116 | True | typical |
Now that we have two DataFrame
s with a common key we can use pd.merge
.
merged_info = pd.merge(data, sample_info,
left_on = 'Patient', # The column of the key in biome_data
right_on = 'PID', # The column of the key in sample_info
how = 'inner') # Keep only those in both
merged_info.head()
Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | PID | severe_disease | disease_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3062 | Nasal Vestibule | Swab | 2516 | 44 | 14987 | 0 | 16 | Firmicutes | 3062 | False | persistent |
1 | 3094 | Nasal Vestibule | Swab | 103 | 0 | 1397 | 0 | 15 | Firmicutes | 3094 | False | persistent |
2 | 3095 | Nasal Vestibule | Swab | 1474 | 0 | 5510 | 29 | 21 | Firmicutes | 3095 | False | persistent |
3 | 3115 | Nasal Vestibule | Swab | 0 | 0 | 5480 | 0 | 2 | Firmicutes | 3115 | True | typical |
4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes | 3116 | True | typical |
Q4: Calculate the average counts of each phylum by severe_disease
.#
q4_severe_means = merged_info.groupby('severe_disease')[PHYLUM_COLS].mean() # SOLUTION
grader.check("q4_servere")
We can also do more advanced things like this:
merged_info.groupby(['Location', 'severe_disease'])[PHYLUM_COLS].aggregate(['mean', 'std'])
Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | ||||||
---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | ||
Location | severe_disease | ||||||||
Ethmoid Culture (Deep to Ethmoid Bulla) | False | 1382.714286 | 875.245435 | 4.142857 | 10.106575 | 2053.142857 | 1314.670989 | 31.571429 | 32.444091 |
True | 1625.200000 | 1698.904706 | 2.400000 | 5.366563 | 3743.400000 | 3394.971988 | 260.600000 | 492.952127 | |
Ethmoid Tissue (Deep to Ethmoid Bulla) | False | 1689.400000 | 1065.430570 | 0.000000 | 0.000000 | 1232.200000 | 722.258403 | 91.400000 | 64.103822 |
True | 1149.500000 | 1066.299676 | 1.750000 | 3.500000 | 2319.750000 | 2767.632728 | 93.000000 | 125.078642 | |
Head of Inferior Turbinate Tissue | False | 394.833333 | 282.110912 | 1.166667 | 2.401388 | 1123.333333 | 343.322395 | 19.500000 | 26.883080 |
True | 223.600000 | 219.127360 | 0.000000 | 0.000000 | 3416.600000 | 2021.523510 | 31.000000 | 42.023803 | |
Maxillary Sinus | False | 2539.500000 | 2260.620379 | 0.000000 | 0.000000 | 2999.000000 | 1224.708945 | 150.000000 | 193.747258 |
True | 842.500000 | 1143.391665 | 0.000000 | 0.000000 | 3006.500000 | 3877.066481 | 95.000000 | 134.350288 | |
Maxillary Sinus Tissue | False | 265.000000 | NaN | 0.000000 | NaN | 448.000000 | NaN | 17.000000 | NaN |
True | 1332.500000 | 1827.871029 | 0.000000 | 0.000000 | 4225.000000 | 5392.396313 | 107.000000 | 147.078210 | |
Middle Meatus | False | 1171.000000 | 747.548527 | 0.333333 | 0.816497 | 2916.166667 | 940.973627 | 19.666667 | 18.062853 |
True | 1143.200000 | 1343.776097 | 0.000000 | 0.000000 | 3389.600000 | 1624.755459 | 71.000000 | 97.739450 | |
Nasal Vestibule | False | 1598.000000 | 1496.139811 | 17.857143 | 32.313199 | 4987.000000 | 4857.272383 | 52.714286 | 119.600326 |
True | 142.666667 | 245.375902 | 0.000000 | 0.000000 | 2940.333333 | 2294.448154 | 3.333333 | 4.932883 | |
Sphenoethmoidal Recess Tissue | False | 1286.000000 | 1863.761385 | 6.200000 | 11.278298 | 1723.600000 | 2203.662020 | 95.000000 | 174.608133 |
True | 487.000000 | 427.834080 | 1.250000 | 2.500000 | 2389.750000 | 1640.061660 | 276.750000 | 382.792698 | |
Sphenoid | False | 1953.428571 | 1362.865958 | 9.142857 | 23.751692 | 3441.000000 | 2253.827337 | 35.142857 | 36.503098 |
True | 2004.600000 | 3145.778966 | 0.200000 | 0.447214 | 4353.000000 | 5425.135298 | 118.000000 | 124.715677 | |
Sphenoid Tissue | False | 1105.000000 | 615.182900 | 0.000000 | 0.000000 | 743.500000 | 57.275649 | 6.500000 | 9.192388 |
True | 2101.333333 | 1953.139609 | 3.000000 | 5.196152 | 3217.000000 | 3062.488694 | 212.333333 | 251.573714 | |
Superior Meatus | False | 2235.000000 | 1402.061102 | 2.000000 | 4.472136 | 4252.285714 | 2796.542014 | 75.000000 | 69.622793 |
True | 961.600000 | 847.056846 | 0.000000 | 0.000000 | 2560.200000 | 1857.766186 | 200.800000 | 326.720370 | |
Uncinate Process Tissue | False | 766.400000 | 718.500383 | 0.000000 | 0.000000 | 1440.200000 | 291.232210 | 18.400000 | 17.700282 |
True | 971.400000 | 1580.324745 | 3.400000 | 4.979960 | 3079.000000 | 2892.901658 | 63.000000 | 60.930288 |
Here I’ve broken things down by body-site and disease status and calculated both a mean and standard deviation. In future lectures we will explore how to quantify this with a significance test. For now, we’ll leave it as a visual comparison.
Pivoting & Melting Dataframes#
This is a process of reshaping, and optionally summarizing, your data as you convert it between wide
and long
format.
These techniques are often required for generating different types of plots.
These are best shown by example.
Pivoting#
long
-> wide
pd.pivot_table(merged_info,
index = 'Patient',
columns = 'Location',
values = 'Firmicutes',
aggfunc = 'mean')
Location | Ethmoid Culture (Deep to Ethmoid Bulla) | Ethmoid Tissue (Deep to Ethmoid Bulla) | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | Superior Meatus | Uncinate Process Tissue |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient | ||||||||||||
3062 | 2714.0 | NaN | 1746.0 | NaN | NaN | 1822.0 | 14987.0 | NaN | 6125.0 | NaN | 2992.0 | 1210.0 |
3094 | 385.0 | 664.0 | 1089.0 | NaN | NaN | 4316.0 | 1397.0 | 5559.0 | 4648.0 | 784.0 | 3463.0 | 1538.0 |
3095 | 2300.0 | 651.0 | 760.0 | NaN | NaN | 2999.0 | 5510.0 | NaN | 519.0 | 703.0 | 10097.0 | 1244.0 |
3115 | 2168.0 | NaN | 4640.0 | NaN | NaN | 3525.0 | 5480.0 | 1861.0 | 13948.0 | NaN | 982.0 | 1531.0 |
3116 | 9796.0 | 6357.0 | 6261.0 | 5748.0 | 8038.0 | 988.0 | 2324.0 | 3673.0 | 617.0 | 6709.0 | 3866.0 | 7249.0 |
3117 | 1800.0 | 672.0 | 1499.0 | NaN | NaN | 2810.0 | 1017.0 | NaN | 2082.0 | 1954.0 | 3671.0 | 1233.0 |
3118 | 2459.0 | 380.0 | 2962.0 | NaN | NaN | 5253.0 | NaN | 3724.0 | 2322.0 | NaN | 4144.0 | 4925.0 |
3119 | 2494.0 | 1870.0 | 1721.0 | 265.0 | 412.0 | 4372.0 | NaN | 301.0 | 2796.0 | 988.0 | 138.0 | 457.0 |
3120 | 905.0 | 1614.0 | 1136.0 | NaN | NaN | 2342.0 | 515.0 | 415.0 | 2655.0 | NaN | 2037.0 | 1301.0 |
3121 | 3060.0 | NaN | 863.0 | NaN | NaN | 3686.0 | 4295.0 | 1624.0 | 6135.0 | NaN | 4740.0 | NaN |
3123 | 1029.0 | 914.0 | NaN | 3865.0 | NaN | NaN | 5885.0 | 358.0 | 1207.0 | NaN | 4497.0 | NaN |
3124 | 3979.0 | 2318.0 | 1146.0 | 2133.0 | 448.0 | 2332.0 | 2320.0 | 662.0 | 2798.0 | NaN | 1940.0 | 1908.0 |
This took our “long” data format in which each row represented the observation at a different site of a different person
and converted it into a “wide” data format such that each row is a patient and each column is a number of Firmicutes at a location.
NaNs
represent missing information.
We also had to “give up” some information for this transformation … this is only Firmicutes.
One can do this to include more information:
pd.pivot_table(merged_info,
index = 'Patient',
columns = 'Location',
values = ['Actinobacteria', 'Firmicutes'],
aggfunc = 'mean')
Actinobacteria | ... | Firmicutes | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Location | Ethmoid Culture (Deep to Ethmoid Bulla) | Ethmoid Tissue (Deep to Ethmoid Bulla) | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | ... | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | Superior Meatus | Uncinate Process Tissue |
Patient | |||||||||||||||||||||
3062 | 461.0 | NaN | 232.0 | NaN | NaN | 292.0 | 2516.0 | NaN | 623.0 | NaN | ... | 1746.0 | NaN | NaN | 1822.0 | 14987.0 | NaN | 6125.0 | NaN | 2992.0 | 1210.0 |
3094 | 243.0 | 638.0 | 282.0 | NaN | NaN | 1486.0 | 103.0 | 4604.0 | 2690.0 | 1540.0 | ... | 1089.0 | NaN | NaN | 4316.0 | 1397.0 | 5559.0 | 4648.0 | 784.0 | 3463.0 | 1538.0 |
3095 | 1652.0 | 1038.0 | 444.0 | NaN | NaN | 2010.0 | 1474.0 | NaN | 587.0 | 670.0 | ... | 760.0 | NaN | NaN | 2999.0 | 5510.0 | NaN | 519.0 | 703.0 | 10097.0 | 1244.0 |
3115 | 80.0 | NaN | 1.0 | NaN | NaN | 13.0 | 0.0 | 211.0 | 17.0 | NaN | ... | 4640.0 | NaN | NaN | 3525.0 | 5480.0 | 1861.0 | 13948.0 | NaN | 982.0 | 1531.0 |
3116 | 340.0 | 61.0 | 19.0 | 34.0 | 40.0 | 5.0 | 2.0 | 181.0 | 32.0 | 309.0 | ... | 6261.0 | 5748.0 | 8038.0 | 988.0 | 2324.0 | 3673.0 | 617.0 | 6709.0 | 3866.0 | 7249.0 |
3117 | 1546.0 | 1108.0 | 521.0 | NaN | NaN | 965.0 | 426.0 | NaN | 1537.0 | 1812.0 | ... | 1499.0 | NaN | NaN | 2810.0 | 1017.0 | NaN | 2082.0 | 1954.0 | 3671.0 | 1233.0 |
3118 | 1802.0 | 824.0 | 332.0 | NaN | NaN | 1465.0 | NaN | 454.0 | 923.0 | NaN | ... | 2962.0 | NaN | NaN | 5253.0 | NaN | 3724.0 | 2322.0 | NaN | 4144.0 | 4925.0 |
3119 | 4358.0 | 2605.0 | 245.0 | 1651.0 | 2625.0 | 3268.0 | NaN | 1102.0 | 7514.0 | 4183.0 | ... | 1721.0 | 265.0 | 412.0 | 4372.0 | NaN | 301.0 | 2796.0 | 988.0 | 138.0 | 457.0 |
3120 | 2740.0 | 3265.0 | 68.0 | NaN | NaN | 405.0 | 107.0 | 468.0 | 4195.0 | NaN | ... | 1136.0 | NaN | NaN | 2342.0 | 515.0 | 415.0 | 2655.0 | NaN | 2037.0 | 1301.0 |
3121 | 1971.0 | NaN | 452.0 | NaN | NaN | 911.0 | 582.0 | 399.0 | 2816.0 | NaN | ... | 863.0 | NaN | NaN | 3686.0 | 4295.0 | 1624.0 | 6135.0 | NaN | 4740.0 | NaN |
3123 | 1010.0 | 2263.0 | NaN | 941.0 | NaN | NaN | 4186.0 | 226.0 | 861.0 | NaN | ... | NaN | 3865.0 | NaN | NaN | 5885.0 | 358.0 | 1207.0 | NaN | 4497.0 | NaN |
3124 | 1602.0 | 1243.0 | 891.0 | 4138.0 | 265.0 | 1922.0 | 2218.0 | 733.0 | 1902.0 | NaN | ... | 1146.0 | 2133.0 | 448.0 | 2332.0 | 2320.0 | 662.0 | 2798.0 | NaN | 1940.0 | 1908.0 |
12 rows × 24 columns
But that is usually not a great idea.
Melting#
wide
-> long
Our data is part long
and part wide
(like most real datasets).
In some plotting instances we may want to make it “longer” by having each bacteria be a diffent row instead of a different column.
pd.melt(merged_info,
id_vars = ['Patient', 'Location'], # The things you want preserved in each row
value_vars = PHYLUM_COLS, # The columns you want to melt
var_name = 'Phylum', # The name of the column that will have the value_var name
value_name = 'Counts') # The name of the column that will have the value
Patient | Location | Phylum | Counts | |
---|---|---|---|---|
0 | 3062 | Nasal Vestibule | Actinobacteria | 2516 |
1 | 3094 | Nasal Vestibule | Actinobacteria | 103 |
2 | 3095 | Nasal Vestibule | Actinobacteria | 1474 |
3 | 3115 | Nasal Vestibule | Actinobacteria | 0 |
4 | 3116 | Nasal Vestibule | Actinobacteria | 2 |
... | ... | ... | ... | ... |
427 | 3094 | Sphenoid Tissue | Proteobacteria | 13 |
428 | 3095 | Sphenoid Tissue | Proteobacteria | 0 |
429 | 3116 | Sphenoid Tissue | Proteobacteria | 495 |
430 | 3117 | Sphenoid Tissue | Proteobacteria | 129 |
431 | 3119 | Sphenoid Tissue | Proteobacteria | 13 |
432 rows × 4 columns
We’ll explore these in more detail as we move into plotting next week. This is just a taste.