Walkthrough

Walkthrough#

Learning Objectives#

At the end of this learning activity you will be able to:

Relate the null-hypotheses of chi2 tests, t-tests, ANOVAs, and Krustal-Wallace H-tests.
Calculate cross-tabulations of two categorical variables using pd.crosstab.
Employ pg.chi2_independence to estimate the correlation between two categorical variables.
Describe how to test if a distribution is normally distributed using multiple methods.
Employ pg.ttest, pg.anova, and pg.kruskal to look for differences in a dependent variable between different categorical variables.
Utilize post-hoc tests to refine the answer of an ANOVA.

Introduction#

As we discussed with Dr. Devlin in the introduction video, this week and next we are going to look at HIV neurocognitive impairment data from a cohort here at Drexel. Each patient was given a full-scale neuropsychological exam and the resulting values were aggregated and normalized into Z-scores based on demographically matched healthy individuals.

In this walkthrough we will explore the demographic variables within this dataset to practice different types of hypothesis testing. In the lab you will dive deeper and explore the effects of antiretroviral medications on neurological impairment. In this cohort, we have two major drug regimens, d4T (Stavudine) and the newer Emtricitabine/tenofovir (Truvada). The older Stavudine is suspected to have neurotoxic effects that are not found in the newer Truvada. We will use inferential statistics to understand this effect.

Hypothesis Testing#

Hypothesis testing is a type of inferential statistics in which one compares a distribution to a previously assumed null model that represents the simplest explanation of the world. The process of the test then involves rigorously evaluating the likelihood of observing this dataset under that null assumption. If, after rigorous evaluation, we find that the likelihood of generating this dataset under null model is below some threshold (ie p<0.05) we reject the null hypothesis and accept the alternative.

While one can test arbitrarily complicated hypotheses, in biology we tend to fall back on a few go-to tests:

2-factor tests (chi2-tests)
Tests of means (t-tests, ANOVAs, etc.)
Correlations (linear regression)

We’ll cover the first two this week, and the last next week.

When boiled down, all of these tests rigorously balance four things:

Effect size : How large is difference observed divided by the spread of the observations.
Type 1 Error (false-positives, p-value) : The likelihood of incorrectly rejecting the null hypothesis.
Type 2 Error (false-negatives, power) : The likelihood of incorrectly accepting the null hypothesis.
Number of observations.

Pingouin#

The pingouin Python package is a statistical tool designed to bridge the gap between the data manipulation capabilities of pandas and the visualization strengths of seaborn, with a specific focus on ease of use and accessibility. It is particularly tailored towards researchers who are embarking on their journey into data exploration and biostatistics, leveraging their familiarity with pandas for data manipulation and seaborn for data visualization.

pingouin is built on top of pandas DataFrame objects, which means that it integrates seamlessly with the pandas ecosystem. This allows users to effortlessly prepare their data using pandas and then apply statistical tests and models using pingouin. The package offers a wide range of statistical functions, including t-tests, ANOVAs, correlation tests, and regression analyses, with a syntax that is more intuitive and less verbose than some of its counterparts. This makes pingouin an attractive option for biologists who may not have a deep background in statistics but need to conduct sophisticated statistical analyses as part of their research.

Two other popular packages for statistical analysis in python are scipy.stats and statsmodels. Scipy offers many low-level functions for calculating basic statistical tests like t-tests, ANOVAs, Mann-Whitney, etc. However, these low level functions require extensive programming knowledge to leverage properly.

Statsmodels offers a wealth of complex statistical modeling, p-value correction tools, and allows arbitrarily complicated regression analysis. However, due to its complexity, it is not approachable for beginners.

Under the hood, Pingouin employs scipy.stats, statsmodels, and a number of other tools into a simple and intuitive API designed to be user-friendly and accessible. The outputs it creates are interpretable at a glance and contain outputs from a number of lower-level tools. This is intended to make it easier to apply statistical tests and interpret their results without getting bogged down in the complexities of statistical theory. Additionally, pingouin is designed to work well with Pandas and the Jupyter Notebook environment, which is popular among scientists for interactive data analysis.

Check out the documentation for Pingouin.

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

import pingouin as pg

%matplotlib inline

data = pd.read_csv('hiv_neuro_data.csv')
data['education'] = data['education'].astype(float)
data.head()

	sex	age	education	race	processing_domain_z	exec_domain_z	language_domain_z	visuospatial_domain_z	learningmemory_domain_z	motor_domain_z	ART	YearsSeropositive
0	male	62	10.0	AA	0.5	0.6	0.151646	-1.0	-1.152131	-1.364306	Stavudine	13
1	male	56	10.0	AA	-0.5	1.2	-0.255505	-2.0	-0.086376	-0.348600	Truvada	19
2	female	51	10.0	AA	0.5	0.1	0.902004	-0.4	-1.139892	0.112215	Stavudine	9
3	female	47	12.0	AA	-0.6	-1.2	-0.119866	-2.1	0.803619	-2.276768	Truvada	24
4	male	46	13.0	AA	-0.4	1.3	0.079129	-1.3	-0.533607	-0.330541	Truvada	14

When studying HIV Associated Neurocognitive Disorder (HAND) we measure a number of cognitive domains through a ~90 min test battery. The results of these tests are then aggregated into 6 cognitive domains.

Speed of Information Processing : processing_domain_z : How quickly an individual can understand and react to the information being presented. This domain evaluates the speed at which cognitive tasks can be performed, often under time constraints.
Learning and Memory : learningmemory_domain_z : This involves both the acquisition of new information (learning) and the ability to retrieve this information over time (memory). It includes both verbal learning and memory (e.g., remembering words or stories) and visual learning and memory (e.g., remembering shapes or patterns).
Executive Function : exec_domain_z : The complex cognitive processes required for planning, organizing, problem-solving, abstract thinking, and executing strategies. This domain also encompasses decision-making and cognitive flexibility, which is the ability to switch between thinking about two different concepts or to think about multiple concepts simultaneously.
Motor Skills : motor_domain_z : This domain assesses the coordination, speed, and dexterity of motor movements, including both fine motor skills (e.g., hand movements, writing) and gross motor skills (e.g., walking, coordination).
Language Skills : language_domain_z : The ability to understand and produce language. While not always as prominently affected in HAND as other cognitive domains, assessments might include naming, fluency, and comprehension tasks.
Sensory-Perceptual Skills : visuospatial_domain_z : Although less commonly assessed in HAND research, this domain involves the ability to interpret sensory information, including visual and auditory processing.

The outcomes of these test are highly correlated with age, sex, race, and education. This makes it difficult to compare the raw results of a test between individuals across these covariates; ie. is this 65 year old performing worse on a memory task relative to a 25 year old because of their age or because of some type of impairement? With the raw scores alone, this is impossible.

However, if we collect data from a large number of presumably healthy individuals we can create a normalization scale. These are normalized to a Z-scale. In a z-scaled value, z = 0 is the mean; all individuals performing at the average of covariate matched controls will have a score of 0. Each unit of a z-scale is 1-standard-deviation. An individual with a z = 1 is performing 1 standard-deviation above thier peers while a z = -1.5 implies that they are performing 1.5 stds below their matched peers.

You’ve probably heard of this before: an IQ score works in a similar way. It uses a standard scale. This has a mean of 100 and a standard deviation of 10. An average IQ is 100 (by definition). An IQ of 210 (Marylin vos Savant) is 11 standard-deviations above an average individual.

Categorical comparisons#

Q1: Count the number of participants of each sex and race.#

q1_race_counts = data['race'].value_counts() # SOLUTION
q1_sex_counts = data['sex'].value_counts() # SOLUTION

grader.check("q1_demographic_breakdown")

Once we have the counts, we can change each into a probability. This represents the likelihood of observing any given instance of the categorical variable. We can think of this like an unfiar set of dice.

q1_race_counts/q1_race_counts.sum()

race
AA    0.676923
C     0.227692
H     0.095385
Name: count, dtype: float64

q1_sex_counts/q1_sex_counts.sum()

sex
male      0.692308
female    0.307692
Name: count, dtype: float64

Let’s do the same thing for the ART column.

art_counts = data['ART'].value_counts()
art_counts/art_counts.sum()

ART
Stavudine    0.52
Truvada      0.48
Name: count, dtype: float64

I wonder if there is a relationship between demographic variables and ART therapy?

That was looking at each measurement individually, if we want to know whether they are correlated, we need a different type of table. We need a table that measures the number of occurances of male & Stavudine, female & Stavudine, male & Truvada, female & Truvada, etc. This is often called a cross tabulation. Traditionally one would use a set of for-loops to count this data, but Pandas has a tool for exactly this purpose.

pd.crosstab(data['sex'], data['ART'], margins=True)

ART	Stavudine	Truvada	All
sex
female	53	47	100
male	116	109	225
All	169	156	325

It can also give us frequencies.

pd.crosstab(data['race'], data['ART'], margins=True, normalize='all')

ART	Stavudine	Truvada	All
race
AA	0.313846	0.363077	0.676923
C	0.153846	0.073846	0.227692
H	0.052308	0.043077	0.095385
All	0.520000	0.480000	1.000000

Hypothesis Testing#

When testing the correlation between two categorical variables we want to use a chi2 test. It has the null hypothesis that the two categorical distributions are uncorrelated. Rejection of that hypothesis (ie p<0.05) indicates there is some linkage between the two distributions.

Notes:

In your cross tabulation table you need at least 5 observations in each cell. Fewer than that can lead to significant losses in statistical power.

When we want to visualize a categorical comparison, it is best to use a countplot that uses both x and hue to show the two factors. If there is a difference, we will see that the bars have inconsistent relative heights when compared to each other.

sns.countplot(data = data,
              x = 'sex',
              hue = 'ART')

<Axes: xlabel='sex', ylabel='count'>

../../_images/7b9cc1b3a156f054c81a054aa53fa42aabe0abda5d600b7727a251968748f816.png

Pingouin has a function for performing a chi2 test from a pd.Dataframe directly: pg.chi2_independence. It will return three things:

# Pingouin does the crosstab internally, it just wants the df and columns

expected, obs_cor, stats = pg.chi2_independence(data, 'sex', 'ART')

If these two variables were randomly associated, what would I have gotten?

expected

ART	Stavudine	Truvada
sex
female	52.0	48.0
male	117.0	108.0

What did I actually observe after correction?

obs_cor

ART	Stavudine	Truvada
sex
female	52.5	47.5
male	116.5	108.5

What is the likelihood that obs_cor is different from expected?

stats

	test	lambda	chi2	dof	pval	cramer	power
0	pearson	1.000000	0.014468	1.0	0.904260	0.006672	0.051659
1	cressie-read	0.666667	0.014468	1.0	0.904258	0.006672	0.051659
2	log-likelihood	0.000000	0.014470	1.0	0.904253	0.006673	0.051659
3	freeman-tukey	-0.500000	0.014471	1.0	0.904249	0.006673	0.051659
4	mod-log-likelihood	-1.000000	0.014472	1.0	0.904244	0.006673	0.051659
5	neyman	-2.000000	0.014475	1.0	0.904235	0.006674	0.051660

p ~= 0.9, therefore we cannot reject the hypothesis that ART therapy is unrelated to sex.

Q2: Is race and education correlated in this dataset?#

# Bin education into three bins:
#  * Up to HS education (0, 12]
#  * HS-4yr College (12, 16]
#  * >4yr College (16, np.inf]
# pandas has a function pd.cut that that accomplish this.

# Add this column back into the dataframe as a new column called `education_binned`
data['education_binned'] = pd.cut(data['education'], [0, 12, 16, np.inf])

# Calculate the chi2 test between race and education_binned
q2_expected, q2_obs_cor, q2_stats = pg.chi2_independence(data, 'race', 'education_binned') # SOLUTION NO PROMPT

# Is race and education correlated? (yes or no)
q2_pval_ans = 'yes' # SOLUTION

# Are there "more" or "less" H's with 12-16 years of education than we'd expect by chance
q2_inter_ans = 'less' # SOLUTION

grader.check("q2_demographic_education")

So, we found that there is a link between race and education, with AA and H individuals have less people in the higher education brackets and more in the lower education brackets.

In this instance we used stratification to split education into categorical variables representing meaningful cutoffs. We can also keep the education value as a continious measurement and compare across categories.

Continious comparisons#

In order to choose the right test we need to walk down a set of decision trees to find the appropriate answer. There are a number of places you can find these decision trees (or you can make your own), we’ll use the one from the pingouin tool.

https://pingouin-stats.org/build/html/guidelines.html

The first thing to test: normality.

If the variable is normally distributed (or can be approximated as normal) we can use more powerful parametric tests. If it is not normal, we need to use non-parametric tests.

A qqplot is a common way to visualize the fit between an observed distribution and a theoretical one. It plots the observed data against a desired distribution (usaually the normal distribution). If there is strong agreement, it likely means the data fits the distribution. It is the best place to start when starting considering normality.

pg.qqplot(data['education'], dist='norm')

<Axes: xlabel='Theoretical quantiles', ylabel='Ordered quantiles'>

../../_images/9ef76f1a3721fd50f57043aa24d74693288e5db4486b73eb449271e86f69628b.png

Extremely high agreement. I would expect this to be normally distributed. Let’s use Pingouin to test it.

# Tests the null-hypothesis: is this derived from a normal distribution
# with the default parameters

pg.normality(data['education'])

	W	pval	normal
education	0.975734	0.000027	False

Interesting, it rejected the hypothesis that this data is normally distributed. This is why you NEED to have documentation open when utilizing these tools.

https://pingouin-stats.org/build/html/generated/pingouin.normality.html

method: str Normality test.

‘shapiro’ (default) performs the Shapiro-Wilk test using scipy.stats.shapiro(),

‘normaltest’ performs the omnibus test of normality using scipy.stats.normaltest(),

‘jarque_bera’ performs the Jarque-Bera test using scipy.stats.jarque_bera().

The Omnibus and Jarque-Bera tests are more suitable than the Shapiro test for large samples.

pg.normality(data['education'], method='normaltest')

	W	pval	normal
education	5.843312	0.053844	True

pg.normality(data['education'], method='jarque_bera')

	W	pval	normal
education	5.585359	0.061257	True

Now we see that both of these methods, and the qq-plot agree that this data is normally disitributed.

Two group measurement#

sns.barplot(data = data,
            x = 'sex',
            y = 'education')

<Axes: xlabel='sex', ylabel='education'>

../../_images/3eade58d4c91a028bf0b43f02439e2eada55071f03fa54ac3ddcf4fdf0d2136c.png

Handy-dandy t-test. Measures the null-hypothesis:

One-sample: Does this sample have a mean of mu?
Two-sample (independent): Do these two distribution have the mean value?
Two-sample (paired): Do these pre/post measurements of the same sample have the same mean value?

male_edu = data.query('sex == "male"')['education']
female_edu = data.query('sex == "female"')['education']
pg.ttest(male_edu, female_edu)

	T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
T-test	2.245961	202.108089	two-sided	0.025789	[0.07, 1.13]	0.263056	1.438	0.587972

Breaking this down:

T = 2.245961 : Called the T-statistic : (mean_M-mean_F)/std_P*sqrt(1/N_m+1/N_f)
dof = 202.1 : The number of samples after dropping missing and applying corrections.
alternative = two-sided : The mean of females can be larger or smaller than males.
p-val = 0.025 : The likelihood of seeing this difference by chance.
CI95% = [0.07, 1.13] : The difference in means between males and females is between 0.7 and 1.13 years.
cohen-d = 0.26 : The effect size : (mean_M-mean_F)/std_P
BF10 = 1.43 : Bayes Factor : P(D|H1)/P(D|H0) : The likelihood of the data given the alternative hypothesis divided by the likelihood of the data under the null hypothesis.
power = 0.587 : The likelihood of see an effect if there truly was one given cohen-D and dof.

Multi-group measurement#

sns.barplot(data = data,
            x = 'race',
            y = 'education')

<Axes: xlabel='race', ylabel='education'>

../../_images/1d755bdf0388dcd2efba63b386bb0e227e9e6a78a404b2cf162320adf9725755.png

Handy-dandy One-way ANOVA with post-hoc tests.

https://pingouin-stats.org/build/html/guidelines.html#anova

# First test for equal variances to pick correct ANOVA

pg.homoscedasticity(data, dv='education', group='race')

	W	pval	equal_var
levene	1.292877	0.275901	True

pg.anova(data, dv='education', between='race', detailed=True)

	Source	SS	DF	MS	F	p-unc	np2
0	race	131.796715	2	65.898357	13.37776	0.000003	0.076717
1	Within	1586.160208	322	4.925963	NaN	NaN	NaN

Here the ANOVA rejects the hypothesis that the mean is the same across races. However, it does not tell us which ones are different. That’s where post-hoc testing comes in. Since we have equal-varainces across the group, it suggets Tukey’s post hoc test.

pg.pairwise_tukey(data=data, dv='education', between='race')

	A	B	mean(A)	mean(B)	diff	se	T	p-tukey	hedges
0	AA	C	12.077273	13.608108	-1.530835	0.298258	-5.132588	0.000001	-0.697499
1	AA	H	12.077273	12.193548	-0.116276	0.425785	-0.273085	0.959729	-0.054118
2	C	H	13.608108	12.193548	1.414560	0.474836	2.979050	0.008714	0.567577

This provided a table with a row for each comparison. p-tukey indicates the rows with significant differences.

We can also do Multi-way ANOVAs; this is where you have more than one independent variable splitting your data. Here, we can look at how both race and gender impact educational attainment.

sns.barplot(data = data,
            x = 'race',
            hue = 'sex',
            y = 'education')

<Axes: xlabel='race', ylabel='education'>

../../_images/182a3ba2f749e574e5458fbc3a07d9b5091a73af440adab2cfa35aa3f05286a6.png

pg.anova(data, dv='education', between=['race', 'sex'], detailed=True)

	Source	SS	DF	MS	F	p-unc	np2
0	race	133.744087	2.0	66.872043	13.700951	0.000002	0.079104
1	sex	27.055406	1.0	27.055406	5.543195	0.019158	0.017080
2	race * sex	2.119345	2.0	1.059672	0.217109	0.804961	0.001359
3	Residual	1556.985457	319.0	4.880832	NaN	NaN	NaN

This tells us that there is likely a difference between race, between sex, but there likely isn’t an interactive term. We can also do the post-hoc tests across all comparisons.

results = pg.pairwise_tests(data=data, dv='education', between=['race', 'sex'])
results

	Contrast	race	A	B	Paired	Parametric	T	dof	alternative	p-unc	BF10	hedges
0	race	-	AA	C	False	True	-4.794172	110.376193	two-sided	0.000005	5562.177	-0.697499
1	race	-	AA	H	False	True	-0.247486	36.168163	two-sided	0.805932	0.209	-0.054118
2	race	-	C	H	False	True	2.658937	55.713740	two-sided	0.010214	4.718	0.567577
3	sex	-	female	male	False	True	-2.245961	202.108089	two-sided	0.025789	1.438	-0.262445
4	race * sex	AA	female	male	False	True	-1.826097	149.778935	two-sided	0.069827	0.739	-0.254068
5	race * sex	C	female	male	False	True	-1.230186	41.659681	two-sided	0.225529	0.481	-0.313088
6	race * sex	H	female	male	False	True	-1.322973	8.498605	two-sided	0.220332	0.753	-0.461862

results.loc[results['p-unc'] < 0.05]

	Contrast	race	A	B	Paired	Parametric	T	dof	alternative	p-unc	BF10	hedges
0	race	-	AA	C	False	True	-4.794172	110.376193	two-sided	0.000005	5562.177	-0.697499
2	race	-	C	H	False	True	2.658937	55.713740	two-sided	0.010214	4.718	0.567577
3	sex	-	female	male	False	True	-2.245961	202.108089	two-sided	0.025789	1.438	-0.262445

If we had repeated measures for individuals (ie biological replicates) we can also add , within = 'sample'. This would help correct for within sample variability.

Non-parametric comparisons#

Q3: Use the appropriate non-parametric method.#

Let’s pretend that we believed the first Shapiro-Wilk normality test and instead used non-parametric methods. Use the flowchart below to test the whether there is a difference in education level across race.

https://pingouin-stats.org/build/html/guidelines.html#non-parametric

Non-parametric flow chart


Points	5
Public Checks	4

Points: 5

# Use the appropriate test from the Guidelines
q3_stats = pg.kruskal(data, dv='education', between='race') # SOLUTION
q3_stats

	Source	ddof1	H	p-unc
Kruskal	race	2	22.498491	0.000013

# Use the appropriate post-hoc test from the Guidelines
q3_post_hoc = pg.pairwise_tests(data, dv='education', between='race', parametric=False) # SOLUTION
q3_post_hoc

	Contrast	A	B	Paired	Parametric	U-val	alternative	p-unc	hedges
0	race	AA	C	False	False	5199.0	two-sided	0.000003	-0.697499
1	race	AA	H	False	False	3482.0	two-sided	0.848419	-0.054118
2	race	C	H	False	False	1540.5	two-sided	0.005371	0.567577

Compare this answer to the parametric method.

# Are they the 'same' or 'different'

q3_comparison = 'same' # SOLUTION

grader.check("q3_nonparametric")

We should see that we’ll get the same answer between parametric and non-parametric methods.

We’ve now seen a handful of methods for testing the difference between populations. In the lab, you will use these skills to test the effects of different ART drugs on neurocognitive impairment.

grader.check_all()

Walkthrough

Contents

Walkthrough#

Learning Objectives#

Introduction#

Hypothesis Testing#

Pingouin#

Categorical comparisons#

Q1: Count the number of participants of each sex and race.#

Hypothesis Testing#

Q2: Is race and education correlated in this dataset?#

Continious comparisons#

Two group measurement#

Multi-group measurement#

Non-parametric comparisons#

Q3: Use the appropriate non-parametric method.#