Walkthrough#
Introduction#
An emerging area of biomedical research over the past decade as been the human microbiome. This field studies the commensal bacteria that inhabit our bodies and how they influence our health. These can be found everywhere from our digestive system, to our skin, to our ears, and to every part of our body. Often times hundreds of different bacterial species can be isolated from a single body site of a single individual. Disease can be caused or exacerbated by an imbalance in these species.
This week, we will explore the data generated by researchers here at Drexel. From a collection of 12 patients they measured the microbiome of of 12 body sites in their nasal passages. Some of these patients had inner ear infections (otitis media) and different disease outcomes. This week, we will use Python to generate pivot-tables and bar-plots to understand whether the microbiome is impacted by disease outcome.
Learning Objectives#
At the end of this learning activity you will be able to:
- Practice using - queryto extract data from a larger table.
- Calculate summary values across a a - pd.DataFrameusing methods like sum, mean, and max.
- Utilize - pd.DataFrame.groupbyto aggregate and transform data by group.
- Use - pd.mergeto combine data held in two different tables.
- Employ - pd.pivot_tableand- pd.meltto reshape and summarize data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Note that this is a .tsv, a tab-separated-value file
data = pd.read_csv('microbiome_phylum_data.tsv', sep = '\t')
data
| Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3062 | Nasal Vestibule | Swab | 2516 | 44 | 14987 | 0 | 16 | Firmicutes | 
| 1 | 3094 | Nasal Vestibule | Swab | 103 | 0 | 1397 | 0 | 15 | Firmicutes | 
| 2 | 3095 | Nasal Vestibule | Swab | 1474 | 0 | 5510 | 29 | 21 | Firmicutes | 
| 3 | 3115 | Nasal Vestibule | Swab | 0 | 0 | 5480 | 0 | 2 | Firmicutes | 
| 4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 103 | 3094 | Sphenoid Tissue | Biopsy | 1540 | 0 | 784 | 13 | 14 | Actinobacteria | 
| 104 | 3095 | Sphenoid Tissue | Biopsy | 670 | 0 | 703 | 0 | 14 | Firmicutes | 
| 105 | 3116 | Sphenoid Tissue | Biopsy | 309 | 9 | 6709 | 495 | 17 | Firmicutes | 
| 106 | 3117 | Sphenoid Tissue | Biopsy | 1812 | 0 | 1954 | 129 | 18 | Firmicutes | 
| 107 | 3119 | Sphenoid Tissue | Biopsy | 4183 | 0 | 988 | 13 | 23 | Actinobacteria | 
108 rows × 9 columns
Exploring a single patient#
First, we’ll explore the distribution of bacteria of a single individual across body site.
Q1: Extract the information for patient 3116#
pat_3116 = data.query('Patient == 3116')  # SOLUTION
pat_3116.head()
| Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes | 
| 14 | 3116 | Head of Inferior Turbinate Tissue | Biopsy | 19 | 0 | 6261 | 21 | 5 | Firmicutes | 
| 25 | 3116 | Middle Meatus | Swab | 5 | 0 | 988 | 2 | 10 | Firmicutes | 
| 36 | 3116 | Uncinate Process Tissue | Biopsy | 17 | 0 | 7249 | 79 | 12 | Firmicutes | 
| 42 | 3116 | Maxillary Sinus | Swab | 34 | 0 | 5748 | 190 | 11 | Firmicutes | 
grader.check("q1_extract_single")
Q2: Calculate the average count across regions for each phylum for patient 3116.#
q2_Actinobacteria_mean = pat_3116['Actinobacteria'].mean() # SOLUTION
q2_Bacteroidetes_mean = pat_3116['Bacteroidetes'].mean() # SOLUTION
q2_Firmicutes_mean = pat_3116['Firmicutes'].mean() # SOLUTION
q2_Proteobacteria_mean = pat_3116['Proteobacteria'].mean() # SOLUTION
grader.check("q2_summary_vals")
Summarizing by grouping#
Now that we’ve looked at the summary values for a single individual, how would we look at this for each individual?
Copy-pasting that over and over is unsustainable, DataFrames have useful methods for dealing with this problem.
All of these fall into the same basic strategy.
Split - Apply - Combine.
# Split
grouped_patients = data.groupby('Patient')
# Apply - Combine
# Capitalizing constants is useful if you will re-use them often.
PHYLUM_COLS = ['Actinobacteria', 'Bacteroidetes',
                'Firmicutes', 'Proteobacteria']
mean_vals = grouped_patients[PHYLUM_COLS].mean()
mean_vals
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| Patient | ||||
| 3062 | 727.571429 | 7.714286 | 4513.714286 | 2.000000 | 
| 3094 | 1356.900000 | 0.100000 | 2384.300000 | 79.100000 | 
| 3095 | 1447.111111 | 0.111111 | 2753.666667 | 47.777778 | 
| 3115 | 49.875000 | 1.375000 | 4266.875000 | 22.000000 | 
| 3116 | 95.416667 | 1.250000 | 5135.500000 | 299.750000 | 
| 3117 | 1164.333333 | 0.000000 | 1859.777778 | 69.000000 | 
| 3118 | 1329.125000 | 3.125000 | 3271.125000 | 224.375000 | 
| 3119 | 2692.727273 | 0.000000 | 1437.636364 | 5.727273 | 
| 3120 | 1679.333333 | 0.000000 | 1435.555556 | 18.888889 | 
| 3121 | 1205.428571 | 18.571429 | 3486.142857 | 29.714286 | 
| 3123 | 1852.142857 | 12.285714 | 2536.428571 | 129.285714 | 
| 3124 | 1603.545455 | 0.000000 | 1998.545455 | 39.272727 | 
# This is commonly done in a single "sentence"
mean_vals = data.groupby('Patient')[PHYLUM_COLS].mean()
Q3: Calculate the average counts of each phylum by body site.#
q3_mean_phylum_site = data.groupby('Location')[PHYLUM_COLS].mean() # SOLUTION
q3_mean_phylum_site
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| Location | ||||
| Ethmoid Culture (Deep to Ethmoid Bulla) | 1483.750000 | 3.416667 | 2757.416667 | 127.000000 | 
| Ethmoid Tissue (Deep to Ethmoid Bulla) | 1449.444444 | 0.777778 | 1715.555556 | 92.111111 | 
| Head of Inferior Turbinate Tissue | 317.000000 | 0.636364 | 2165.727273 | 24.727273 | 
| Maxillary Sinus | 1691.000000 | 0.000000 | 3002.750000 | 122.500000 | 
| Maxillary Sinus Tissue | 976.666667 | 0.000000 | 2966.000000 | 77.000000 | 
| Middle Meatus | 1158.363636 | 0.181818 | 3131.363636 | 43.000000 | 
| Nasal Vestibule | 1161.400000 | 12.500000 | 4373.000000 | 37.900000 | 
| Sphenoethmoidal Recess Tissue | 930.888889 | 4.000000 | 2019.666667 | 175.777778 | 
| Sphenoid | 1974.750000 | 5.416667 | 3821.000000 | 69.666667 | 
| Sphenoid Tissue | 1702.800000 | 1.800000 | 2227.600000 | 130.000000 | 
| Superior Meatus | 1704.416667 | 1.166667 | 3547.250000 | 127.416667 | 
| Uncinate Process Tissue | 868.900000 | 1.700000 | 2259.600000 | 40.700000 | 
grader.check("q3_mean_by_site")
There are a number of different built-in summary functions like this.
data.groupby('Patient')[PHYLUM_COLS].median()
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| Patient | ||||
| 3062 | 461.0 | 2.0 | 2714.0 | 0.0 | 
| 3094 | 1000.0 | 0.0 | 1467.5 | 41.5 | 
| 3095 | 1223.0 | 0.0 | 1244.0 | 15.0 | 
| 3115 | 15.0 | 0.0 | 2846.5 | 1.5 | 
| 3116 | 37.0 | 0.0 | 6004.5 | 183.5 | 
| 3117 | 1108.0 | 0.0 | 1800.0 | 66.0 | 
| 3118 | 1023.5 | 0.0 | 3343.0 | 152.5 | 
| 3119 | 2605.0 | 0.0 | 988.0 | 3.0 | 
| 3120 | 468.0 | 0.0 | 1301.0 | 27.0 | 
| 3121 | 911.0 | 12.0 | 3686.0 | 13.0 | 
| 3123 | 1010.0 | 0.0 | 1207.0 | 63.0 | 
| 3124 | 1602.0 | 0.0 | 2133.0 | 19.0 | 
data.groupby('Patient')[PHYLUM_COLS].count()
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| Patient | ||||
| 3062 | 7 | 7 | 7 | 7 | 
| 3094 | 10 | 10 | 10 | 10 | 
| 3095 | 9 | 9 | 9 | 9 | 
| 3115 | 8 | 8 | 8 | 8 | 
| 3116 | 12 | 12 | 12 | 12 | 
| 3117 | 9 | 9 | 9 | 9 | 
| 3118 | 8 | 8 | 8 | 8 | 
| 3119 | 11 | 11 | 11 | 11 | 
| 3120 | 9 | 9 | 9 | 9 | 
| 3121 | 7 | 7 | 7 | 7 | 
| 3123 | 7 | 7 | 7 | 7 | 
| 3124 | 11 | 11 | 11 | 11 | 
data.groupby('Patient')[PHYLUM_COLS].max()
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| Patient | ||||
| 3062 | 2516 | 44 | 14987 | 6 | 
| 3094 | 4604 | 1 | 5559 | 406 | 
| 3095 | 3926 | 1 | 10097 | 166 | 
| 3115 | 211 | 11 | 13948 | 108 | 
| 3116 | 340 | 9 | 9796 | 1139 | 
| 3117 | 1812 | 0 | 3671 | 129 | 
| 3118 | 3709 | 12 | 5253 | 771 | 
| 3119 | 7514 | 0 | 4372 | 18 | 
| 3120 | 4195 | 0 | 2655 | 36 | 
| 3121 | 2816 | 63 | 6135 | 78 | 
| 3123 | 4186 | 81 | 5885 | 323 | 
| 3124 | 4138 | 0 | 3979 | 183 | 
You can see an extensive list of available summary functions at the Pandas Documentation
If there isn’t a function that does what you want, you can also make your own.
Here is a simple one that scales the data to a unit-norm.
def unit_norm(values):
    "Given a series, return a scaled version"
    mu = values.mean()
    std = values.std()
    return (values-mu)/std
unit_normed_data = data.groupby('Patient', as_index=False)[PHYLUM_COLS].transform(unit_norm)
unit_normed_data
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | |
|---|---|---|---|---|
| 0 | 2.158109 | 2.247876 | 2.140374 | -0.755929 | 
| 1 | -0.901854 | -0.316228 | -0.515722 | -0.660942 | 
| 2 | 0.025250 | -0.333333 | 0.862714 | -0.303077 | 
| 3 | -0.694809 | -0.353553 | 0.287822 | -0.549657 | 
| 4 | -0.790041 | -0.441315 | -0.985677 | -0.838082 | 
| ... | ... | ... | ... | ... | 
| 103 | 0.131693 | -0.316228 | -0.835926 | -0.552316 | 
| 104 | -0.729756 | -0.333333 | -0.641845 | -0.771142 | 
| 105 | 1.806312 | 2.736155 | 0.551650 | 0.547734 | 
| 106 | 1.315743 | NaN | 0.101533 | 1.390656 | 
| 107 | 0.700800 | NaN | -0.331381 | 1.106277 | 
108 rows × 4 columns
Notice I used the transform method here instead of a common name.
When applying custom functions to groups of data there are three different methods depending on your final output shape:
- .aggregate()or- .agg()- Each group of data produces a single summary number. Commonly used to summarize groups.
- .transform()- The output will have the same number (and order) of rows as the input. Commonly used for normalizations.
- .apply()- Everything else.
Merging data#
Now we come to a common problem, our sample information is in a different file.
sample_info = pd.read_csv('sample_info.csv')
sample_info.head()
| PID | severe_disease | disease_type | |
|---|---|---|---|
| 0 | 3062 | False | persistent | 
| 1 | 3094 | False | persistent | 
| 2 | 3095 | False | persistent | 
| 3 | 3115 | True | typical | 
| 4 | 3116 | True | typical | 
Now that we have two DataFrames with a common key we can use pd.merge.
merged_info = pd.merge(data, sample_info,
                       left_on = 'Patient', # The column of the key in biome_data
                       right_on = 'PID', # The column of the key in sample_info
                       how = 'inner') # Keep only those in both
merged_info.head()
| Patient | Location | CollectionType | Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | num_otu | Predominant | PID | severe_disease | disease_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3062 | Nasal Vestibule | Swab | 2516 | 44 | 14987 | 0 | 16 | Firmicutes | 3062 | False | persistent | 
| 1 | 3094 | Nasal Vestibule | Swab | 103 | 0 | 1397 | 0 | 15 | Firmicutes | 3094 | False | persistent | 
| 2 | 3095 | Nasal Vestibule | Swab | 1474 | 0 | 5510 | 29 | 21 | Firmicutes | 3095 | False | persistent | 
| 3 | 3115 | Nasal Vestibule | Swab | 0 | 0 | 5480 | 0 | 2 | Firmicutes | 3115 | True | typical | 
| 4 | 3116 | Nasal Vestibule | Swab | 2 | 0 | 2324 | 1 | 4 | Firmicutes | 3116 | True | typical | 
Q4: Calculate the average counts of each phylum by severe_disease.#
q4_severe_means = merged_info.groupby('severe_disease')[PHYLUM_COLS].mean() # SOLUTION
grader.check("q4_servere")
We can also do more advanced things like this:
merged_info.groupby(['Location', 'severe_disease'])[PHYLUM_COLS].aggregate(['mean', 'std'])
| Actinobacteria | Bacteroidetes | Firmicutes | Proteobacteria | ||||||
|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | ||
| Location | severe_disease | ||||||||
| Ethmoid Culture (Deep to Ethmoid Bulla) | False | 1382.714286 | 875.245435 | 4.142857 | 10.106575 | 2053.142857 | 1314.670989 | 31.571429 | 32.444091 | 
| True | 1625.200000 | 1698.904706 | 2.400000 | 5.366563 | 3743.400000 | 3394.971988 | 260.600000 | 492.952127 | |
| Ethmoid Tissue (Deep to Ethmoid Bulla) | False | 1689.400000 | 1065.430570 | 0.000000 | 0.000000 | 1232.200000 | 722.258403 | 91.400000 | 64.103822 | 
| True | 1149.500000 | 1066.299676 | 1.750000 | 3.500000 | 2319.750000 | 2767.632728 | 93.000000 | 125.078642 | |
| Head of Inferior Turbinate Tissue | False | 394.833333 | 282.110912 | 1.166667 | 2.401388 | 1123.333333 | 343.322395 | 19.500000 | 26.883080 | 
| True | 223.600000 | 219.127360 | 0.000000 | 0.000000 | 3416.600000 | 2021.523510 | 31.000000 | 42.023803 | |
| Maxillary Sinus | False | 2539.500000 | 2260.620379 | 0.000000 | 0.000000 | 2999.000000 | 1224.708945 | 150.000000 | 193.747258 | 
| True | 842.500000 | 1143.391665 | 0.000000 | 0.000000 | 3006.500000 | 3877.066481 | 95.000000 | 134.350288 | |
| Maxillary Sinus Tissue | False | 265.000000 | NaN | 0.000000 | NaN | 448.000000 | NaN | 17.000000 | NaN | 
| True | 1332.500000 | 1827.871029 | 0.000000 | 0.000000 | 4225.000000 | 5392.396313 | 107.000000 | 147.078210 | |
| Middle Meatus | False | 1171.000000 | 747.548527 | 0.333333 | 0.816497 | 2916.166667 | 940.973627 | 19.666667 | 18.062853 | 
| True | 1143.200000 | 1343.776097 | 0.000000 | 0.000000 | 3389.600000 | 1624.755459 | 71.000000 | 97.739450 | |
| Nasal Vestibule | False | 1598.000000 | 1496.139811 | 17.857143 | 32.313199 | 4987.000000 | 4857.272383 | 52.714286 | 119.600326 | 
| True | 142.666667 | 245.375902 | 0.000000 | 0.000000 | 2940.333333 | 2294.448154 | 3.333333 | 4.932883 | |
| Sphenoethmoidal Recess Tissue | False | 1286.000000 | 1863.761385 | 6.200000 | 11.278298 | 1723.600000 | 2203.662020 | 95.000000 | 174.608133 | 
| True | 487.000000 | 427.834080 | 1.250000 | 2.500000 | 2389.750000 | 1640.061660 | 276.750000 | 382.792698 | |
| Sphenoid | False | 1953.428571 | 1362.865958 | 9.142857 | 23.751692 | 3441.000000 | 2253.827337 | 35.142857 | 36.503098 | 
| True | 2004.600000 | 3145.778966 | 0.200000 | 0.447214 | 4353.000000 | 5425.135298 | 118.000000 | 124.715677 | |
| Sphenoid Tissue | False | 1105.000000 | 615.182900 | 0.000000 | 0.000000 | 743.500000 | 57.275649 | 6.500000 | 9.192388 | 
| True | 2101.333333 | 1953.139609 | 3.000000 | 5.196152 | 3217.000000 | 3062.488694 | 212.333333 | 251.573714 | |
| Superior Meatus | False | 2235.000000 | 1402.061102 | 2.000000 | 4.472136 | 4252.285714 | 2796.542014 | 75.000000 | 69.622793 | 
| True | 961.600000 | 847.056846 | 0.000000 | 0.000000 | 2560.200000 | 1857.766186 | 200.800000 | 326.720370 | |
| Uncinate Process Tissue | False | 766.400000 | 718.500383 | 0.000000 | 0.000000 | 1440.200000 | 291.232210 | 18.400000 | 17.700282 | 
| True | 971.400000 | 1580.324745 | 3.400000 | 4.979960 | 3079.000000 | 2892.901658 | 63.000000 | 60.930288 | |
Here I’ve broken things down by body-site and disease status and calculated both a mean and standard deviation. In future lectures we will explore how to quantify this with a significance test. For now, we’ll leave it as a visual comparison.
Pivoting & Melting Dataframes#
This is a process of reshaping, and optionally summarizing, your data as you convert it between wide and long format.
These techniques are often required for generating different types of plots.
These are best shown by example.
Pivoting#
long -> wide
pd.pivot_table(merged_info,
               index = 'Patient',
               columns = 'Location',
               values = 'Firmicutes',
               aggfunc = 'mean')
| Location | Ethmoid Culture (Deep to Ethmoid Bulla) | Ethmoid Tissue (Deep to Ethmoid Bulla) | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | Superior Meatus | Uncinate Process Tissue | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient | ||||||||||||
| 3062 | 2714.0 | NaN | 1746.0 | NaN | NaN | 1822.0 | 14987.0 | NaN | 6125.0 | NaN | 2992.0 | 1210.0 | 
| 3094 | 385.0 | 664.0 | 1089.0 | NaN | NaN | 4316.0 | 1397.0 | 5559.0 | 4648.0 | 784.0 | 3463.0 | 1538.0 | 
| 3095 | 2300.0 | 651.0 | 760.0 | NaN | NaN | 2999.0 | 5510.0 | NaN | 519.0 | 703.0 | 10097.0 | 1244.0 | 
| 3115 | 2168.0 | NaN | 4640.0 | NaN | NaN | 3525.0 | 5480.0 | 1861.0 | 13948.0 | NaN | 982.0 | 1531.0 | 
| 3116 | 9796.0 | 6357.0 | 6261.0 | 5748.0 | 8038.0 | 988.0 | 2324.0 | 3673.0 | 617.0 | 6709.0 | 3866.0 | 7249.0 | 
| 3117 | 1800.0 | 672.0 | 1499.0 | NaN | NaN | 2810.0 | 1017.0 | NaN | 2082.0 | 1954.0 | 3671.0 | 1233.0 | 
| 3118 | 2459.0 | 380.0 | 2962.0 | NaN | NaN | 5253.0 | NaN | 3724.0 | 2322.0 | NaN | 4144.0 | 4925.0 | 
| 3119 | 2494.0 | 1870.0 | 1721.0 | 265.0 | 412.0 | 4372.0 | NaN | 301.0 | 2796.0 | 988.0 | 138.0 | 457.0 | 
| 3120 | 905.0 | 1614.0 | 1136.0 | NaN | NaN | 2342.0 | 515.0 | 415.0 | 2655.0 | NaN | 2037.0 | 1301.0 | 
| 3121 | 3060.0 | NaN | 863.0 | NaN | NaN | 3686.0 | 4295.0 | 1624.0 | 6135.0 | NaN | 4740.0 | NaN | 
| 3123 | 1029.0 | 914.0 | NaN | 3865.0 | NaN | NaN | 5885.0 | 358.0 | 1207.0 | NaN | 4497.0 | NaN | 
| 3124 | 3979.0 | 2318.0 | 1146.0 | 2133.0 | 448.0 | 2332.0 | 2320.0 | 662.0 | 2798.0 | NaN | 1940.0 | 1908.0 | 
This took our “long” data format in which each row represented the observation at a different site of a different person
and converted it into a “wide” data format such that each row is a patient and each column is a number of Firmicutes at a location.
NaNs represent missing information.
We also had to “give up” some information for this transformation … this is only Firmicutes.
One can do this to include more information:
pd.pivot_table(merged_info,
               index = 'Patient',
               columns = 'Location',
               values = ['Actinobacteria', 'Firmicutes'],
               aggfunc = 'mean')
| Actinobacteria | ... | Firmicutes | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Location | Ethmoid Culture (Deep to Ethmoid Bulla) | Ethmoid Tissue (Deep to Ethmoid Bulla) | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | ... | Head of Inferior Turbinate Tissue | Maxillary Sinus | Maxillary Sinus Tissue | Middle Meatus | Nasal Vestibule | Sphenoethmoidal Recess Tissue | Sphenoid | Sphenoid Tissue | Superior Meatus | Uncinate Process Tissue | 
| Patient | |||||||||||||||||||||
| 3062 | 461.0 | NaN | 232.0 | NaN | NaN | 292.0 | 2516.0 | NaN | 623.0 | NaN | ... | 1746.0 | NaN | NaN | 1822.0 | 14987.0 | NaN | 6125.0 | NaN | 2992.0 | 1210.0 | 
| 3094 | 243.0 | 638.0 | 282.0 | NaN | NaN | 1486.0 | 103.0 | 4604.0 | 2690.0 | 1540.0 | ... | 1089.0 | NaN | NaN | 4316.0 | 1397.0 | 5559.0 | 4648.0 | 784.0 | 3463.0 | 1538.0 | 
| 3095 | 1652.0 | 1038.0 | 444.0 | NaN | NaN | 2010.0 | 1474.0 | NaN | 587.0 | 670.0 | ... | 760.0 | NaN | NaN | 2999.0 | 5510.0 | NaN | 519.0 | 703.0 | 10097.0 | 1244.0 | 
| 3115 | 80.0 | NaN | 1.0 | NaN | NaN | 13.0 | 0.0 | 211.0 | 17.0 | NaN | ... | 4640.0 | NaN | NaN | 3525.0 | 5480.0 | 1861.0 | 13948.0 | NaN | 982.0 | 1531.0 | 
| 3116 | 340.0 | 61.0 | 19.0 | 34.0 | 40.0 | 5.0 | 2.0 | 181.0 | 32.0 | 309.0 | ... | 6261.0 | 5748.0 | 8038.0 | 988.0 | 2324.0 | 3673.0 | 617.0 | 6709.0 | 3866.0 | 7249.0 | 
| 3117 | 1546.0 | 1108.0 | 521.0 | NaN | NaN | 965.0 | 426.0 | NaN | 1537.0 | 1812.0 | ... | 1499.0 | NaN | NaN | 2810.0 | 1017.0 | NaN | 2082.0 | 1954.0 | 3671.0 | 1233.0 | 
| 3118 | 1802.0 | 824.0 | 332.0 | NaN | NaN | 1465.0 | NaN | 454.0 | 923.0 | NaN | ... | 2962.0 | NaN | NaN | 5253.0 | NaN | 3724.0 | 2322.0 | NaN | 4144.0 | 4925.0 | 
| 3119 | 4358.0 | 2605.0 | 245.0 | 1651.0 | 2625.0 | 3268.0 | NaN | 1102.0 | 7514.0 | 4183.0 | ... | 1721.0 | 265.0 | 412.0 | 4372.0 | NaN | 301.0 | 2796.0 | 988.0 | 138.0 | 457.0 | 
| 3120 | 2740.0 | 3265.0 | 68.0 | NaN | NaN | 405.0 | 107.0 | 468.0 | 4195.0 | NaN | ... | 1136.0 | NaN | NaN | 2342.0 | 515.0 | 415.0 | 2655.0 | NaN | 2037.0 | 1301.0 | 
| 3121 | 1971.0 | NaN | 452.0 | NaN | NaN | 911.0 | 582.0 | 399.0 | 2816.0 | NaN | ... | 863.0 | NaN | NaN | 3686.0 | 4295.0 | 1624.0 | 6135.0 | NaN | 4740.0 | NaN | 
| 3123 | 1010.0 | 2263.0 | NaN | 941.0 | NaN | NaN | 4186.0 | 226.0 | 861.0 | NaN | ... | NaN | 3865.0 | NaN | NaN | 5885.0 | 358.0 | 1207.0 | NaN | 4497.0 | NaN | 
| 3124 | 1602.0 | 1243.0 | 891.0 | 4138.0 | 265.0 | 1922.0 | 2218.0 | 733.0 | 1902.0 | NaN | ... | 1146.0 | 2133.0 | 448.0 | 2332.0 | 2320.0 | 662.0 | 2798.0 | NaN | 1940.0 | 1908.0 | 
12 rows × 24 columns
But that is usually not a great idea.
Melting#
wide -> long
Our data is part long and part wide (like most real datasets).
In some plotting instances we may want to make it “longer” by having each bacteria be a diffent row instead of a different column.
pd.melt(merged_info,
        id_vars = ['Patient', 'Location'], # The things you want preserved in each row
        value_vars = PHYLUM_COLS, # The columns you want to melt
        var_name = 'Phylum', # The name of the column that will have the value_var name
        value_name = 'Counts') # The name of the column that will have the value
| Patient | Location | Phylum | Counts | |
|---|---|---|---|---|
| 0 | 3062 | Nasal Vestibule | Actinobacteria | 2516 | 
| 1 | 3094 | Nasal Vestibule | Actinobacteria | 103 | 
| 2 | 3095 | Nasal Vestibule | Actinobacteria | 1474 | 
| 3 | 3115 | Nasal Vestibule | Actinobacteria | 0 | 
| 4 | 3116 | Nasal Vestibule | Actinobacteria | 2 | 
| ... | ... | ... | ... | ... | 
| 427 | 3094 | Sphenoid Tissue | Proteobacteria | 13 | 
| 428 | 3095 | Sphenoid Tissue | Proteobacteria | 0 | 
| 429 | 3116 | Sphenoid Tissue | Proteobacteria | 495 | 
| 430 | 3117 | Sphenoid Tissue | Proteobacteria | 129 | 
| 431 | 3119 | Sphenoid Tissue | Proteobacteria | 13 | 
432 rows × 4 columns
We’ll explore these in more detail as we move into plotting next week. This is just a taste.