Welcome Spring 2025 Students!

Walkthrough#

Learning Objectives#

At the end of this learning activity you will be able to:

  • Create categorical comparisons with countplots.

  • Create quantitative comparison plots with seaborn: stripplot, barplot, boxplot with Seaborn.

  • Create correlation style plots with with scatterplot and regplot.

  • Utilize pd.melt to plot wide data with seaborn.

  • Describe bootstapping and confidence intervals.

This week we will continue our exploration of data from a cohort study participants of People Living with HIV (PLwH) here at Drexel.

As we discussed in the introduction, this data collection effort was done to provide a resource for many projects across the fields of HIV, aging, inflammation, neurocognitive impairment, immune function, and unknowable future projects. In this walkthrough we will explore a collection of cytokines and chemokines measured by a Luminex panel of common biomarkers of inflammation. We use this data to look for correlations between cytokine biomarkers and demographic variables.

Documentation#

I HIGHLY recommend exploring and utilizing the documentation for this tool:

When I’m making figures I’ll usually have the documentation open on another tab.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# This is how we normally import seaborn
import seaborn as sns

%matplotlib inline
data = pd.read_csv('cytokine_data.csv')
data.head()
Sex Age isAA egf eotaxin fgfbasic gcsf gmcsf hgf ifnalpha ... mig mip1alpha mip1beta tnfalpha vegf cocaine_use cannabinoid_use neuro_screen_impairment_level bmi years_infected
0 Male 53.0 Checked 65.01 170.20 50.32 117.14 2.51 481.37 110.79 ... 185.29 104.63 151.15 17.61 7.54 True True none 21 18
1 Female 62.0 Checked 232.83 118.23 36.03 215.38 24.53 988.71 66.13 ... 397.24 242.10 230.87 51.22 31.60 True True none 22 16
2 Male 60.0 Checked 84.84 55.27 13.22 14.08 0.48 364.31 78.67 ... 18.63 34.85 68.34 2.48 0.84 False False none 25 16
3 Male 62.0 Checked 24.13 70.18 4.12 14.08 1.33 510.36 118.64 ... 118.63 113.30 49.15 10.93 3.53 True True impaired 29 21
4 Male 54.0 Checked 186.98 69.18 32.56 184.74 12.55 395.87 40.79 ... 140.56 131.83 241.00 32.01 10.81 True True none 26 16

5 rows × 37 columns

Categorical Comparisons#

Counting with countplot#

sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_impairment_level')
<Axes: xlabel='neuro_screen_impairment_level', ylabel='count'>
../../_images/551eab823686412da182ce1a35d5a1b4aacfa14a411f89f4cfcad07ea3811971.png
sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_impairment_level',
              hue = 'cocaine_use')
<Axes: xlabel='neuro_screen_impairment_level', ylabel='count'>
../../_images/aa4368fcb128bd498abff99f935f947128b34410ce5f36aef4e4f23177036b19.png

If we wanted to look at multiple columns. It would be wonderful if we could do this:

# This doesn't work

# sns.countplot(data = data,
#               x = 'neuro_screen_impairment_level',
#               hue = ['cocaine_use', 'cannabinoid_use'])

But, we can create a new column that is the combination of the two:

def use_desc(row):
    cocaine = 'Y' if row['cocaine_use'] else 'N'
    cannabinoid = 'Y' if row['cannabinoid_use'] else 'N'
    return cocaine+cannabinoid

data['multi_use'] = data.apply(use_desc, axis=1)

While we’re doing transformations, let’s improve our plots by converting our neuro_screen_impairment_level to a proper ordinal variable.

  • Categorical variables : A set of distinct and seperable groups.

  • Ordinal variables : A set of distinct and seperable groups with a sortable order.

By default, pandas treats all strings as categorical and it sorts them alphabetically. But sometimes, we want to specify a order. We can do that like so.

pd.Categorical(data['neuro_screen_impairment_level'],
               categories = ['none', 'mild', 'impaired'], ordered=True)
['none', 'none', 'none', 'impaired', 'none', ..., 'none', 'mild', 'mild', 'mild', 'mild']
Length: 224
Categories (3, object): ['none' < 'mild' < 'impaired']
data['neuro_screen_ordinal'] = pd.Categorical(data['neuro_screen_impairment_level'],
                                               categories = ['none', 'mild', 'impaired'], ordered=True)
sns.countplot(data = data, # Pass the Dataframe `data` into the kwarg `data`
              x = 'neuro_screen_ordinal', # Seaborn will respect the ordering of our categories.
              hue = 'multi_use')
<Axes: xlabel='neuro_screen_ordinal', ylabel='count'>
../../_images/fa980573b62ec0f82df1f1357d76745c53de62c462ec2959460019d84829085d.png

Visualizing differences across categories with stripplot#

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1')
<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>
../../_images/e18406e04c14272f53e8cec4478aa809c4150c316225fffe8bfc9d5da60644cd.png
sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              hue = 'Sex',
              y = 'mcp1')
# Try dodge = True
<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>
../../_images/da66129147cf5df4d7517d2812cbef7bd0de6bf8652ab186e04d09e5ad1cea41.png

You can even put plots on top of plots!

# Grab the axis object that was created
ax = sns.stripplot(data = data,
                   x = 'neuro_screen_ordinal',
                   hue = 'Sex',
                   y = 'mcp1',
                   dodge = True)


sns.boxplot(data = data,
              x = 'neuro_screen_ordinal',
              hue = 'Sex',
              y = 'mcp1',
              ax = ax) # Give that axis to the next plot
<Axes: xlabel='neuro_screen_ordinal', ylabel='mcp1'>
../../_images/a09941e9459460f3dfe6ed69bb835580456e0d8719ea543077aaaba5b3674b4f.png

Quantifying the uncertainty of estimates#

When presenting data it is important to rigorously convey the precision and uncertainty of our estimates. I HIGHLY recommend reading the error bar tutorial in seaborn. Below is a quick summary.

The error bars around an estimate of central tendency can show one of two general things: either the range of uncertainty about the estimate or the spread of the underlying data around it. These measures are related: given the same sample size, estimates will be more uncertain when data has a broader spread. But uncertainty will decrease as sample sizes grow, whereas spread will not.

Error bar taxonomy

  • Value Spread: How much each value differs from every other value

    • Standard deviation: Average difference of each point from the mean.

    • Percentile Interval: Middle range that contains some percent of the data.

  • Estimate Uncertainty: What is the reasonable range of my estimate

    • Standard Error: Standard deviation/sqrt(sample_size)

    • Confidence Intervals:

    The nonparametric approach to representing uncertainty uses bootstrapping: a procedure where the dataset is randomly resampled with replacement a number of times, and the estimate is recalculated from each resample. This procedure creates a distribution of statistics approximating the distribution of values that you could have gotten for your estimate if you had a different sample.

Measuring Spread#

Quantifies how much each individual measurement vary around the middle.

# Parametric Assumptions

fig, (par_ax, np_ax) = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'mean',
            errorbar = ('sd', 2),
                ax = par_ax) # 95% CI

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.5,
              ax=par_ax)
par_ax.set_title('Parametric')

## Plot with non-parametric methods

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'median',
            errorbar = ('pi', 95),
                ax = np_ax) # 95% CI

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.5,
              ax=np_ax)

np_ax.set_title('Non-Parametric')

fig.tight_layout()
../../_images/7f9fab85893d521ab60799d72aac22ce41f2d63b26fc77a351777c51472dc37b.png

Measuring Uncertainty#

Quantifies how confident we are about where the middle of the distribution is.

# Parametric Assumptions

fig, (par_ax, np_ax) = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

## Plot with parametric assumpitions
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'mean',
            errorbar = ('se', 2),
                ax = par_ax)

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.25,
              ax=par_ax)
par_ax.set_title('Parametric')

## Plot with non-parametric methods
sns.barplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            estimator = 'median',
            errorbar = ('ci', 95),
                ax = np_ax)

sns.stripplot(data = data,
              x = 'neuro_screen_ordinal',
              y = 'mcp1',
              alpha=0.25,
              ax=np_ax)

np_ax.set_title('Non-Parametric')

fig.tight_layout()
../../_images/9ec90e1bb748322eeec6c2f010e1e974fbf7f6dfb64bad683a61f47dff2a3a5f.png

With hundreds of data points, we are very confident and precise about our estimate of the middle of the distribution.

It is important to understand the effect of sample size on spread and uncertainty. Below is a simulation that picks small subsets of the data and generates the same plot. Before you generate it; which do you think will change with sample-size and which will stay the same?

def overlapped_plot(ax, df, y = 'mcp1', mode = 'spread'):
    "Create overlapped plot with different confidence assumptions"

    if mode == 'spread':
        # Use Standard Deviation
        eb = ('sd', 2)
    elif mode == 'uncertainty':
        # Use Standard error
        eb = ('se', 2)
    else:
        raise ValueError(f'Expected spread or uncertainty, got: {mode}')

    sns.barplot(data = df,
                x = 'neuro_screen_ordinal',
                y = y,
                errorbar = eb,
                ax = ax)

    sns.stripplot(data = df,
                  x = 'neuro_screen_ordinal',
                  y = y,
                  alpha=0.25,
                  ax=ax)
NBINS = 10

BINS = np.linspace(10, len(data.index), NBINS)
fig, axs = plt.subplots(len(BINS), 2, figsize=(8, 50), sharex=True, sharey=True)

for row, sample_size in enumerate(BINS):
    print('Sampling', int(sample_size), 'samples for figure.')
    ndf = data.sample(n=int(sample_size))

    spread_ax, uncer_ax = axs[row,:]

    ## Plot with parametric assumpitions
    overlapped_plot(spread_ax, ndf, mode='spread')
    spread_ax.set_title(f'Spread: n = {sample_size:0.0f}')

    ## Plot with parametric assumpitions
    overlapped_plot(uncer_ax, ndf, mode='uncertainty')
    uncer_ax.set_title(f'Uncertainty: n = {sample_size:0.0f}')
Sampling 10 samples for figure.
Sampling 33 samples for figure.
Sampling 57 samples for figure.
Sampling 81 samples for figure.
Sampling 105 samples for figure.
Sampling 128 samples for figure.
Sampling 152 samples for figure.
Sampling 176 samples for figure.
Sampling 200 samples for figure.
Sampling 224 samples for figure.
../../_images/7fd97bc98da345a05804a1b488c669b733c7480b5b01d69e5a2a2ca626502ccb.png
# Which is impacted by sample size?
# answer: 'spread', 'uncertainty', 'both', 'neither'
impact_of_sample_size = 'uncertainty' # SOLUTION
grader.check("impact_of_sample_size")

So, how are the two measures impacted by sample size?

  • Spread : Sample size has no effect on spread.

  • Uncertainty : Sample size reduces uncertainty.

This type of categorical visualization is limited in that it only considers a single summary statistic. It ignores the “shape” of the data. We can use histograms to explore this.

Comparing Distributions#

Seaborn makes simple graphs easy.

sns.histplot(data = data,
             x = 'mcp1',
             hue = 'neuro_screen_ordinal')
<Axes: xlabel='mcp1', ylabel='Count'>
../../_images/1826a852a744e3fce82ae551efffc99b6712e1afb657d4834a743bd249389ddd.png

If we don’t like overlapping plots we can make it like a heatmap where the color indicates the count.

sns.histplot(data = data,
             x = 'mcp1',
             y  = 'neuro_screen_ordinal',
             cbar = True)
<Axes: xlabel='mcp1', ylabel='neuro_screen_ordinal'>
../../_images/1141d3a49959c977d255d7bfae9ece6012af71102debc71bcb4e829d334a911b.png

Aggregate statistic to compute in each bin.

  • count: show the number of observations in each bin

  • frequency: show the number of observations divided by the bin width

  • probability or proportion: normalize such that bar heights sum to 1

  • percent: normalize such that bar heights sum to 100

  • density: normalize such that the total area of the histogram equals 1

sns.histplot(data = data,
             x = 'mcp1',
             hue = 'neuro_screen_ordinal',
             stat = 'percent',
             common_norm=False) # Each group should have its own "sum to 100"
# hue_order = ['none', 'mild']
<Axes: xlabel='mcp1', ylabel='Percent'>
../../_images/1d61e048c5e2eb7c99699a7d9af073639c6e65e82a76184cf55bbed34ba815d0.png

Measuring Correlation#

The previous figures measured one (or more) categorical variables and a single quantitative variable. If we want to understand how correlated any two quantitative variables are to each other.

sns.scatterplot(data = data,
                x = 'mcp1', 
                y = 'tnfalpha')
<Axes: xlabel='mcp1', ylabel='tnfalpha'>
../../_images/33bb7b98f775b3a06cf701117b8bdf46a810ef9e2029c5a75920e2016d2186e5.png

We can also use a regression plot to draw a best fit line with a confidence interval indicated by a shadow.

sns.regplot(data = data,
            x = 'mcp1',
            y = 'tnfalpha')
<Axes: xlabel='mcp1', ylabel='tnfalpha'>
../../_images/a472026b363968d69773e7b1988b91085afe378b501d4ee14c07e42a27f46d25.png

Figure Level Interface#

All of the code we’ve been running above generates a single figure. The functions below automate generating multiple figures across different facets of the data. This is an incredibly useful way to explore relationships in your data.

Categorical with catplot#

sns.catplot(data = data,
            x = 'neuro_screen_ordinal',
            y = 'mcp1',
            col = 'multi_use')

# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])
<seaborn.axisgrid.FacetGrid at 0x7f0d1f55fa60>
../../_images/6a01b0cc2e7190029a4256c4f5065d071498eb549facd1bdf5b137e84feb700e.png

Each of the plots splits the data into a different categorical subset, then only plots the data within that set. Use the examples in the comments to explore ways to visualize the data.

Relational with relplot#

sns.relplot(data = data,
            x = 'tnfalpha',
            y = 'mcp1',
            col = 'multi_use')

# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])
<seaborn.axisgrid.FacetGrid at 0x7f0d1f2f3b20>
../../_images/3432734424e6dd3fb01e315c3197aabc48bda892207fafb187637b0f3ee819ae.png

Linear model regression plots with lmplot#

sns.lmplot(data = data,
            x = 'tnfalpha',
            y = 'mcp1',
            col = 'multi_use')
# Try:
# col_wrap = 2
# kind = 'bar', box, etc
# Other groupings across row and hue
#  Sex, isAA
#  pd.cut(data['Age'], [18, 30, 50, 100], labels=['young', 'adult', 'aged'])
#  pd.cut(data['tnfalpha'], 3, labels=['low', 'med', 'high'])
<seaborn.axisgrid.FacetGrid at 0x7f0d1d5d6760>
../../_images/ad4565986b77a72d0085e30551c9d322f18ae790571dfa42bece70d168838784.png

Plotting Multiple Columns#

You’ll probably notice that all of the graphs so far have visualized a single cytokine. What if we wanted multiple cytokines in the same figure?

# This doesn't work
# sns.stripplot(data = data,
#               x = 'neuro_screen_category',
#               hue = 'Sex',
#               y = ['mcp1', 'tnfalpha'],
#               dodge = True)

I’m pd.melting#

Seaborn only plots long data. We need to convert our wide data into long data first. This is melting.

data.head()
Sex Age isAA egf eotaxin fgfbasic gcsf gmcsf hgf ifnalpha ... mip1beta tnfalpha vegf cocaine_use cannabinoid_use neuro_screen_impairment_level bmi years_infected multi_use neuro_screen_ordinal
0 Male 53.0 Checked 65.01 170.20 50.32 117.14 2.51 481.37 110.79 ... 151.15 17.61 7.54 True True none 21 18 YY none
1 Female 62.0 Checked 232.83 118.23 36.03 215.38 24.53 988.71 66.13 ... 230.87 51.22 31.60 True True none 22 16 YY none
2 Male 60.0 Checked 84.84 55.27 13.22 14.08 0.48 364.31 78.67 ... 68.34 2.48 0.84 False False none 25 16 NN none
3 Male 62.0 Checked 24.13 70.18 4.12 14.08 1.33 510.36 118.64 ... 49.15 10.93 3.53 True True impaired 29 21 YY impaired
4 Male 54.0 Checked 186.98 69.18 32.56 184.74 12.55 395.87 40.79 ... 241.00 32.01 10.81 True True none 26 16 YY none

5 rows × 39 columns

melted_data = pd.melt(data,
                      id_vars = ['Sex', 'Age', 'neuro_screen_ordinal', 'multi_use'], # What we want replicated across all rows of the same sample
                      value_vars = ['mcp1', 'tnfalpha', 'il6'], # What columns we want to melt into rows
                      value_name = 'concentration', # The name of the value column after melting
                      var_name = 'cytokine') # The name of the variable column after melting
melted_data
Sex Age neuro_screen_ordinal multi_use cytokine concentration
0 Male 53.0 none YY mcp1 468.72
1 Female 62.0 none YY mcp1 591.70
2 Male 60.0 none NN mcp1 132.80
3 Male 62.0 impaired YY mcp1 816.71
4 Male 54.0 none YY mcp1 414.97
... ... ... ... ... ... ...
667 Male 44.0 none NN il6 16.69
668 Male 59.0 mild YY il6 15.55
669 Male 63.0 mild YY il6 11.94
670 Male 41.0 mild NY il6 12.48
671 Male 43.0 mild NN il6 2.40

672 rows × 6 columns

sns.catplot(data = melted_data,
            x = 'neuro_screen_ordinal',
            y = 'concentration',
            col = 'cytokine',
            hue = 'Sex',
            sharey=False,
            kind = 'bar')
<seaborn.axisgrid.FacetGrid at 0x7f0d1d4514f0>
../../_images/8551bad0ba6a54312c21ebaa36d0192f284e8fc40fb7390833dc2f9e498cb56e.png

Whether to use melting data and figure-level seaborn functions or using plt.subplots and then doing axis-level figure generation is a matter of personal perference and the situation.