Walkthrough

Walkthrough#

Learning Objectives#

At the end of this learning activity you will be able to:

Create categorical comparisons with countplots.
Create quantitative comparison plots with seaborn: stripplot, barplot, boxplot with Seaborn.
Create correlation style plots with with scatterplot and regplot.
Utilize pd.melt to plot wide data with seaborn.
Describe bootstapping and confidence intervals.

This week we will continue our exploration of data from a cohort study participants of People Living with HIV (PLwH) here at Drexel.

As we discussed in the introduction, this data collection effort was done to provide a resource for many projects across the fields of HIV, aging, inflammation, neurocognitive impairment, immune function, and unknowable future projects. In this walkthrough we will explore a collection of cytokines and chemokines measured by a Luminex panel of common biomarkers of inflammation. We use this data to look for correlations between cytokine biomarkers and demographic variables.

Documentation#

I HIGHLY recommend exploring and utilizing the documentation for this tool:

Base: https://seaborn.pydata.org/
Gallery of great examples: https://seaborn.pydata.org/examples/index.html
Function documentation: (We will be using the Function Interface. https://seaborn.pydata.org/api.html#function-interface
Tutorial: https://seaborn.pydata.org/tutorial.html

When I’m making figures I’ll usually have the documentation open on another tab.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# This is how we normally import seaborn
import seaborn as sns

%matplotlib inline

data = pd.read_csv('cytokine_data.csv')
data.head()

	Sex	Age	isAA	egf	eotaxin	fgfbasic	gcsf	gmcsf	hgf	ifnalpha	...	mig	mip1alpha	mip1beta	tnfalpha	vegf	cocaine_use	cannabinoid_use	neuro_screen_impairment_level	bmi	years_infected
0	Male	53.0	Checked	65.01	170.20	50.32	117.14	2.51	481.37	110.79	...	185.29	104.63	151.15	17.61	7.54	True	True	none	21	18
1	Female	62.0	Checked	232.83	118.23	36.03	215.38	24.53	988.71	66.13	...	397.24	242.10	230.87	51.22	31.60	True	True	none	22	16
2	Male	60.0	Checked	84.84	55.27	13.22	14.08	0.48	364.31	78.67	...	18.63	34.85	68.34	2.48	0.84	False	False	none	25	16
3	Male	62.0	Checked	24.13	70.18	4.12	14.08	1.33	510.36	118.64	...	118.63	113.30	49.15	10.93	3.53	True	True	impaired	29	21
4	Male	54.0	Checked	186.98	69.18	32.56	184.74	12.55	395.87	40.79	...	140.56	131.83	241.00	32.01	10.81	True	True	none	26	16

5 rows × 37 columns

Categorical Comparisons#

Counting with `countplot`#

sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_impairment_level')

<Axes: xlabel='neuro_screen_impairment_level', ylabel='count'>

../../_images/551eab823686412da182ce1a35d5a1b4aacfa14a411f89f4cfcad07ea3811971.png

sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_impairment_level',
              hue = 'cocaine_use')

<Axes: xlabel='neuro_screen_impairment_level', ylabel='count'>

../../_images/aa4368fcb128bd498abff99f935f947128b34410ce5f36aef4e4f23177036b19.png

If we wanted to look at multiple columns. It would be wonderful if we could do this:

# This doesn't work

# sns.countplot(data = data,
#               x = 'neuro_screen_impairment_level',
#               hue = ['cocaine_use', 'cannabinoid_use'])

But, we can create a new column that is the combination of the two:

def use_desc(row):
    cocaine = 'Y' if row['cocaine_use'] else 'N'
    cannabinoid = 'Y' if row['cannabinoid_use'] else 'N'
    return cocaine+cannabinoid

data['multi_use'] = data.apply(use_desc, axis=1)

While we’re doing transformations, let’s improve our plots by converting our neuro_screen_impairment_level to a proper ordinal variable.

Categorical variables : A set of distinct and seperable groups.
Ordinal variables : A set of distinct and seperable groups with a sortable order.

By default, pandas treats all strings as categorical and it sorts them alphabetically. But sometimes, we want to specify a order. We can do that like so.

pd.Categorical(data['neuro_screen_impairment_level'],
               categories = ['none', 'mild', 'impaired'], ordered=True)

['none', 'none', 'none', 'impaired', 'none', ..., 'none', 'mild', 'mild', 'mild', 'mild']
Length: 224
Categories (3, object): ['none' < 'mild' < 'impaired']

data['neuro_screen_ordinal'] = pd.Categorical(data['neuro_screen_impairment_level'],
                                               categories = ['none', 'mild', 'impaired'], ordered=True)

sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_ordinal', # Seaborn will respect the ordering of our categories.
              hue = 'multi_use')

<Axes: xlabel='neuro_screen_ordinal', ylabel='count'>

../../_images/fa980573b62ec0f82df1f1357d76745c53de62c462ec2959460019d84829085d.png

Visualizing differences across categories with `stripplot`#

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1')

<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>

../../_images/e18406e04c14272f53e8cec4478aa809c4150c316225fffe8bfc9d5da60644cd.png

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              hue = 'Sex',
              y = 'mcp1')
# Try dodge = True

<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>

../../_images/da66129147cf5df4d7517d2812cbef7bd0de6bf8652ab186e04d09e5ad1cea41.png

You can even put plots on top of plots!

# Grab the axis object that was created
ax = sns.stripplot(data = data,
                   x = 'neuro_screen_ordinal',
                   hue = 'Sex',
                   y = 'mcp1',
                   dodge = True)


sns.boxplot(data = data,
              x = 'neuro_screen_ordinal',
              hue = 'Sex',
              y = 'mcp1',
              ax = ax) # Give that axis to the next plot

<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>

../../_images/a09941e9459460f3dfe6ed69bb835580456e0d8719ea543077aaaba5b3674b4f.png

Quantifying the uncertainty of estimates#

When presenting data it is important to rigorously convey the precision and uncertainty of our estimates. I HIGHLY recommend reading the error bar tutorial in seaborn. Below is a quick summary.

The error bars around an estimate of central tendency can show one of two general things: either the range of uncertainty about the estimate or the spread of the underlying data around it. These measures are related: given the same sample size, estimates will be more uncertain when data has a broader spread. But uncertainty will decrease as sample sizes grow, whereas spread will not.

Error bar taxonomy

Value Spread: How much each value differs from every other value
- Standard deviation: Average difference of each point from the mean.
- Percentile Interval: Middle range that contains some percent of the data.
Estimate Uncertainty: What is the reasonable range of my estimate
- Standard Error: Standard deviation/sqrt(sample_size)
- Confidence Intervals:
The nonparametric approach to representing uncertainty uses bootstrapping: a procedure where the dataset is randomly resampled with replacement a number of times, and the estimate is recalculated from each resample. This procedure creates a distribution of statistics approximating the distribution of values that you could have gotten for your estimate if you had a different sample.

Measuring Spread#

Quantifies how much each individual measurement vary around the middle.

# Parametric Assumptions

fig, (par_ax, np_ax) = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'mean',
            errorbar = ('sd', 2),
                ax = par_ax) # 95% CI

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.5,
              ax=par_ax)
par_ax.set_title('Parametric')

## Plot with non-parametric methods

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'median',
            errorbar = ('pi', 95),
                ax = np_ax) # 95% CI

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.5,
              ax=np_ax)

np_ax.set_title('Non-Parametric')

fig.tight_layout()

../../_images/7f9fab85893d521ab60799d72aac22ce41f2d63b26fc77a351777c51472dc37b.png

Measuring Uncertainty#

Quantifies how confident we are about where the middle of the distribution is.

# Parametric Assumptions

fig, (par_ax, np_ax) = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'mean',
            errorbar = ('se', 2),
                ax = par_ax)

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.25,
              ax=par_ax)
par_ax.set_title('Parametric')

## Plot with non-parametric methods
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'median',
            errorbar = ('ci', 95),
                ax = np_ax)

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.25,
              ax=np_ax)

np_ax.set_title('Non-Parametric')

fig.tight_layout()

../../_images/9ec90e1bb748322eeec6c2f010e1e974fbf7f6dfb64bad683a61f47dff2a3a5f.png

With hundreds of data points, we are very confident and precise about our estimate of the middle of the distribution.

It is important to understand the effect of sample size on spread and uncertainty. Below is a simulation that picks small subsets of the data and generates the same plot. Before you generate it; which do you think will change with sample-size and which will stay the same?

def overlapped_plot(ax, df, y = 'mcp1', mode = 'spread'):
    "Create overlapped plot with different confidence assumptions"

    if mode == 'spread':
        # Use Standard Deviation
        eb = ('sd', 2)
    elif mode == 'uncertainty':
        # Use Standard error
        eb = ('se', 2)
    else:
        raise ValueError(f'Expected spread or uncertainty, got: {mode}')

    sns.barplot(data = df,
                x = 'neuro_screen_ordinal',
                y = y,
                errorbar = eb,
                ax = ax)

    sns.stripplot(data = df,
                  x = 'neuro_screen_ordinal',
                  y = y,
                  alpha=0.25,
                  ax=ax)

NBINS = 10

BINS = np.linspace(10, len(data.index), NBINS)
fig, axs = plt.subplots(len(BINS), 2, figsize=(8, 50), sharex=True, sharey=True)

for row, sample_size in enumerate(BINS):
    print('Sampling', int(sample_size), 'samples for figure.')
    ndf = data.sample(n=int(sample_size))

    spread_ax, uncer_ax = axs[row,:]

    ## Plot with parametric assumpitions
    overlapped_plot(spread_ax, ndf, mode='spread')
    spread_ax.set_title(f'Spread: n = {sample_size:0.0f}')

    ## Plot with parametric assumpitions
    overlapped_plot(uncer_ax, ndf, mode='uncertainty')
    uncer_ax.set_title(f'Uncertainty: n = {sample_size:0.0f}')

Sampling 10 samples for figure.
Sampling 33 samples for figure.
Sampling 57 samples for figure.
Sampling 81 samples for figure.
Sampling 105 samples for figure.
Sampling 128 samples for figure.
Sampling 152 samples for figure.
Sampling 176 samples for figure.
Sampling 200 samples for figure.
Sampling 224 samples for figure.

../../_images/7fd97bc98da345a05804a1b488c669b733c7480b5b01d69e5a2a2ca626502ccb.png

# Which is impacted by sample size?
# answer: 'spread', 'uncertainty', 'both', 'neither'
impact_of_sample_size = 'uncertainty' # SOLUTION

grader.check("impact_of_sample_size")

So, how are the two measures impacted by sample size?

Spread : Sample size has no effect on spread.
Uncertainty : Sample size reduces uncertainty.

This type of categorical visualization is limited in that it only considers a single summary statistic. It ignores the “shape” of the data. We can use histograms to explore this.

Comparing Distributions#

Seaborn makes simple graphs easy.

sns.histplot(data = data,
             x = 'mcp1',
             hue = 'neuro_screen_ordinal')

<Axes: xlabel='mcp1', ylabel='Count'>

../../_images/1826a852a744e3fce82ae551efffc99b6712e1afb657d4834a743bd249389ddd.png

If we don’t like overlapping plots we can make it like a heatmap where the color indicates the count.

sns.histplot(data = data,
             x = 'mcp1',
             y  = 'neuro_screen_ordinal',
             cbar = True)

<Axes: xlabel='mcp1', ylabel='neuro_screen_ordinal'>

../../_images/1141d3a49959c977d255d7bfae9ece6012af71102debc71bcb4e829d334a911b.png

Aggregate statistic to compute in each bin.

count: show the number of observations in each bin
frequency: show the number of observations divided by the bin width
probability or proportion: normalize such that bar heights sum to 1
percent: normalize such that bar heights sum to 100
density: normalize such that the total area of the histogram equals 1

sns.histplot(data = data,
             x = 'mcp1',
             hue = 'neuro_screen_ordinal',
             stat = 'percent',
             common_norm=False) # Each group should have its own "sum to 100"
# hue_order = ['none', 'mild']

<Axes: xlabel='mcp1', ylabel='Percent'>

../../_images/1d61e048c5e2eb7c99699a7d9af073639c6e65e82a76184cf55bbed34ba815d0.png

Measuring Correlation#

The previous figures measured one (or more) categorical variables and a single quantitative variable. If we want to understand how correlated any two quantitative variables are to each other.

sns.scatterplot(data = data,
                x = 'mcp1', 
                y = 'tnfalpha')

<Axes: xlabel='mcp1', ylabel='tnfalpha'>

../../_images/33bb7b98f775b3a06cf701117b8bdf46a810ef9e2029c5a75920e2016d2186e5.png

We can also use a regression plot to draw a best fit line with a confidence interval indicated by a shadow.

sns.regplot(data = data,
            x = 'mcp1',
            y = 'tnfalpha')

<Axes: xlabel='mcp1', ylabel='tnfalpha'>

../../_images/a472026b363968d69773e7b1988b91085afe378b501d4ee14c07e42a27f46d25.png

Figure Level Interface#

All of the code we’ve been running above generates a single figure. The functions below automate generating multiple figures across different facets of the data. This is an incredibly useful way to explore relationships in your data.

Categorical with `catplot`#

sns.catplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            col = 'multi_use')

# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])

<seaborn.axisgrid.FacetGrid at 0x7f0d1f55fa60>

../../_images/6a01b0cc2e7190029a4256c4f5065d071498eb549facd1bdf5b137e84feb700e.png

Each of the plots splits the data into a different categorical subset, then only plots the data within that set. Use the examples in the comments to explore ways to visualize the data.

Relational with `relplot`#

sns.relplot(data = data,
            x = 'tnfalpha',
            y = 'mcp1',
            col = 'multi_use')

# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])

<seaborn.axisgrid.FacetGrid at 0x7f0d1f2f3b20>

../../_images/3432734424e6dd3fb01e315c3197aabc48bda892207fafb187637b0f3ee819ae.png

Linear model regression plots with `lmplot`#

sns.lmplot(data = data,
            x = 'tnfalpha',
            y = 'mcp1',
            col = 'multi_use')
# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])

<seaborn.axisgrid.FacetGrid at 0x7f0d1d5d6760>

../../_images/ad4565986b77a72d0085e30551c9d322f18ae790571dfa42bece70d168838784.png

Plotting Multiple Columns#

You’ll probably notice that all of the graphs so far have visualized a single cytokine. What if we wanted multiple cytokines in the same figure?

# This doesn't work
# sns.stripplot(data = data,
#               x = 'neuro_screen_category',
#               hue = 'Sex',
#               y = ['mcp1', 'tnfalpha'],
#               dodge = True)

I’m `pd.melt`ing#

Seaborn only plots long data. We need to convert our wide data into long data first. This is melting.

data.head()

	Sex	Age	isAA	egf	eotaxin	fgfbasic	gcsf	gmcsf	hgf	ifnalpha	...	mip1beta	tnfalpha	vegf	cocaine_use	cannabinoid_use	neuro_screen_impairment_level	bmi	years_infected	multi_use	neuro_screen_ordinal
0	Male	53.0	Checked	65.01	170.20	50.32	117.14	2.51	481.37	110.79	...	151.15	17.61	7.54	True	True	none	21	18	YY	none
1	Female	62.0	Checked	232.83	118.23	36.03	215.38	24.53	988.71	66.13	...	230.87	51.22	31.60	True	True	none	22	16	YY	none
2	Male	60.0	Checked	84.84	55.27	13.22	14.08	0.48	364.31	78.67	...	68.34	2.48	0.84	False	False	none	25	16	NN	none
3	Male	62.0	Checked	24.13	70.18	4.12	14.08	1.33	510.36	118.64	...	49.15	10.93	3.53	True	True	impaired	29	21	YY	impaired
4	Male	54.0	Checked	186.98	69.18	32.56	184.74	12.55	395.87	40.79	...	241.00	32.01	10.81	True	True	none	26	16	YY	none

5 rows × 39 columns

melted_data = pd.melt(data,
                      id_vars = ['Sex', 'Age', 'neuro_screen_ordinal', 'multi_use'], # What we want replicated across all rows of the same sample
                      value_vars = ['mcp1', 'tnfalpha', 'il6'], # What columns we want to melt into rows
                      value_name = 'concentration', # The name of the value column after melting
                      var_name = 'cytokine') # The name of the variable column after melting
melted_data

	Sex	Age	neuro_screen_ordinal	multi_use	cytokine	concentration
0	Male	53.0	none	YY	mcp1	468.72
1	Female	62.0	none	YY	mcp1	591.70
2	Male	60.0	none	NN	mcp1	132.80
3	Male	62.0	impaired	YY	mcp1	816.71
4	Male	54.0	none	YY	mcp1	414.97
...	...	...	...	...	...	...
667	Male	44.0	none	NN	il6	16.69
668	Male	59.0	mild	YY	il6	15.55
669	Male	63.0	mild	YY	il6	11.94
670	Male	41.0	mild	NY	il6	12.48
671	Male	43.0	mild	NN	il6	2.40

672 rows × 6 columns

sns.catplot(data = melted_data,
            x = 'neuro_screen_ordinal',
            y = 'concentration',
            col = 'cytokine',
            hue = 'Sex',
            sharey=False,
            kind = 'bar')

<seaborn.axisgrid.FacetGrid at 0x7f0d1d4514f0>

../../_images/8551bad0ba6a54312c21ebaa36d0192f284e8fc40fb7390833dc2f9e498cb56e.png

Whether to use melting data and figure-level seaborn functions or using plt.subplots and then doing axis-level figure generation is a matter of personal perference and the situation.