Walkthrough#
Introduction#
Get ready to dive into some data analysis as we explore the effectiveness of a hypothetical HIV treatment trial.
In this walkthrough, we have a dataset containing information from 30 people living with HIV (PLWH) who were randomly assigned to a treatment or control group.
After receiving the treatment, they stopped their ART and were monitored weekly for the number of weeks until their first “detectable” viral load was found.
We will use Pandas
to analyze this data and evaluate the treatment’s effectiveness.
By the end of this activity, you will be proficient in loading spreadsheet data into Python, creating derived columns in DataFrames
, and using summary methods like sum, mean, and max.
Let’s get started!
Learning Objectives#
At the end of this learning activity you will be able to:
Practice loading spreadsheet data into Python using
pandas
.Use Python methods to create derived columns in
pd.DataFrames
.Use
Pandas
summary methods like sum, mean, and max.Employ basic filtering and data extraction from
pandas
.
Dataset Reference#
File: trial_data.csv
Columns:
age
: (years) Current age during the study.age_initial_infection
: (years) Age at which the participant was initially infected.initial_viral_load
: (copies/ul) The level of infection at the start of the study.treatment
: (boolean)True
for participant in the treatment group,False
for those in the control group.weeks_to_failure
: (weeks) Time from the treatment to the first week of uncontrolled viral load.
Imports#
While basic Python can do a lot, you have to do everything yourself.
The real power of Python is that you can import
code that is written by others.
For this course, we will use a common data science stack of interoperable tools centered around the Numpy.
There are four that we will use regularly, two of which we’ll cover today.
Numpy#
A numerical Python library that contains incredibly fast arrays, mathematical functions, and other useful utilities.
By convention, the community tends to alias the long numpy
as np
.
import numpy as np
Pandas#
A libary that sits atop numpy
and provides a spreadsheet style object called a DataFrame
along with a plethora of data sciecne utilities.
This is the main tool we will be using for data exploration.
By convention, the community tends to alias the long pandas
as pd
.
import pandas as pd
Nicely, it can read csv
files for us.
trial_df = pd.read_csv('trial_data.csv')
# If a `DataFrame` is the last line, it will display a nice summary
trial_df
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | |
---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 |
1 | 48 | 26 | 66 | False | 4 |
2 | 45 | 36 | 32 | True | 6 |
3 | 43 | 31 | 23 | False | 5 |
4 | 40 | 20 | 45 | True | 5 |
5 | 42 | 20 | 57 | True | 9 |
6 | 55 | 31 | 23 | False | 4 |
7 | 56 | 50 | 22 | False | 4 |
8 | 59 | 33 | 33 | False | 5 |
9 | 51 | 30 | 49 | True | 7 |
10 | 55 | 21 | 94 | False | 3 |
11 | 53 | 42 | 85 | True | 5 |
12 | 40 | 34 | 27 | True | 8 |
13 | 48 | 41 | 99 | False | 3 |
14 | 56 | 41 | 59 | False | 6 |
15 | 53 | 47 | 38 | True | 7 |
16 | 57 | 41 | 42 | True | 8 |
17 | 48 | 33 | 57 | False | 4 |
18 | 51 | 42 | 25 | False | 2 |
19 | 55 | 46 | 45 | False | 1 |
20 | 43 | 24 | 46 | False | 1 |
21 | 48 | 37 | 99 | True | 8 |
22 | 51 | 27 | 36 | False | 2 |
23 | 43 | 34 | 48 | True | 7 |
24 | 51 | 43 | 88 | False | 2 |
25 | 49 | 20 | 76 | False | 5 |
26 | 54 | 47 | 74 | False | 5 |
27 | 45 | 25 | 87 | True | 5 |
28 | 59 | 40 | 49 | False | 5 |
29 | 51 | 43 | 38 | True | 8 |
Note
I often use the suffix _df
when I create pd.DataFrames
.
It is not required, but utilizing naming conventions makes your code easier to understand by yourself and others.
And we should see that this exactly matches the table we saw in Excel.
The object we got back is called a DataFrame
.
type(trial_df)
pandas.core.frame.DataFrame
If we only want to see a small version of the DataFrame
we can use the .head()
method.
trial_df.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | |
---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 |
1 | 48 | 26 | 66 | False | 4 |
2 | 45 | 36 | 32 | True | 6 |
3 | 43 | 31 | 23 | False | 5 |
4 | 40 | 20 | 45 | True | 5 |
Acting on Columns#
We can reference each column by name using square brackets []
.
For example: Extracting the age
column like so:
trial_df['age']
0 55
1 48
2 45
3 43
4 40
5 42
6 55
7 56
8 59
9 51
10 55
11 53
12 40
13 48
14 56
15 53
16 57
17 48
18 51
19 55
20 43
21 48
22 51
23 43
24 51
25 49
26 54
27 45
28 59
29 51
Name: age, dtype: int64
Q2: Calculate the average weeks_to_failure
for the whole population?#
average_weeks = trial_df['weeks_to_failure'].mean() # SOLUTION
grader.check("q2_pop_weeks_to_failure")
We can also summarize an entire DataFrame
with a single command.
trial_df.mean()
age 50.133333
age_initial_infection 34.366667
initial_viral_load 54.266667
treatment 0.400000
weeks_to_failure 4.900000
dtype: float64
In this case the summary went down the columns and calculated a mean for each.
There are a number of other summarization methods.
max()
min()
mode()
median()
var()
std()
nunique()
Note
Methods, are functions that are attached to an object
.
They usually act on the object to provide a summary, perform a transformation, or otherwise utilize the information within the object.
In this case, these summarization methods utilize the information within the trial_df
dataframe to summarize each column.
trial_df.describe()
age | age_initial_infection | initial_viral_load | weeks_to_failure | |
---|---|---|---|---|
count | 30.000000 | 30.000000 | 30.000000 | 30.000000 |
mean | 50.133333 | 34.366667 | 54.266667 | 4.900000 |
std | 5.569209 | 9.041984 | 24.070204 | 2.202663 |
min | 40.000000 | 20.000000 | 22.000000 | 1.000000 |
25% | 45.750000 | 26.250000 | 36.500000 | 3.250000 |
50% | 51.000000 | 34.000000 | 48.500000 | 5.000000 |
75% | 55.000000 | 41.750000 | 72.000000 | 6.750000 |
max | 59.000000 | 50.000000 | 99.000000 | 9.000000 |
Selecting columns is nice. We can also add a new column based on another one.
In HIV research it is often important to know how long someone has been living with HIV. However, this dataset contains their current age, and their age at infection. We can use these two to calculate the length.
# first make a new `Series`
years_infected = trial_df['age'] - trial_df['age_initial_infection']
# Then add that series into the table
trial_df['years_infected'] = years_infected
trial_df.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
1 | 48 | 26 | 66 | False | 4 | 22 |
2 | 45 | 36 | 32 | True | 6 | 9 |
3 | 43 | 31 | 23 | False | 5 | 12 |
4 | 40 | 20 | 45 | True | 5 | 20 |
# Alternatively
trial_df['years_infected'] = trial_df['age'] - trial_df['age_initial_infection']
trial_df.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
1 | 48 | 26 | 66 | False | 4 | 22 |
2 | 45 | 36 | 32 | True | 6 | 9 |
3 | 43 | 31 | 23 | False | 5 | 12 |
4 | 40 | 20 | 45 | True | 5 | 20 |
Acting on Rows#
Indexing#
When selecting rows, or rows and columns, we need to use the .loc
attribute of the DataFrame
.
We can select by row number.
trial_df.loc[0]
age 55
age_initial_infection 26
initial_viral_load 66
treatment False
weeks_to_failure 3
years_infected 29
Name: 0, dtype: object
# We can use a : to indicate a range.
trial_df.loc[0:10]
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
1 | 48 | 26 | 66 | False | 4 | 22 |
2 | 45 | 36 | 32 | True | 6 | 9 |
3 | 43 | 31 | 23 | False | 5 | 12 |
4 | 40 | 20 | 45 | True | 5 | 20 |
5 | 42 | 20 | 57 | True | 9 | 22 |
6 | 55 | 31 | 23 | False | 4 | 24 |
7 | 56 | 50 | 22 | False | 4 | 6 |
8 | 59 | 33 | 33 | False | 5 | 26 |
9 | 51 | 30 | 49 | True | 7 | 21 |
10 | 55 | 21 | 94 | False | 3 | 34 |
# We can provide an arbitrary list
trial_df.loc[[0, 5, 7, 13]]
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
5 | 42 | 20 | 57 | True | 9 | 22 |
7 | 56 | 50 | 22 | False | 4 | 6 |
13 | 48 | 41 | 99 | False | 3 | 7 |
# We can also select columns at the same time.
trial_df.loc[[0, 5, 7, 13], ['initial_viral_load', 'age']]
initial_viral_load | age | |
---|---|---|
0 | 66 | 55 |
5 | 57 | 42 |
7 | 22 | 56 |
13 | 99 | 48 |
Boolean Indexing#
If we do not know the row number ahead of time, but instead want to select rows based on their values, we can using boolean indexing.
In this stragey we create a new pd.Series
of True/False values where True corresponds to the ones we want.
Start by finding everyone over 50 years old.
age_mask = trial_df['age'] > 50
aged_samples = trial_df.loc[age_mask]
aged_samples.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
6 | 55 | 31 | 23 | False | 4 | 24 |
7 | 56 | 50 | 22 | False | 4 | 6 |
8 | 59 | 33 | 33 | False | 5 | 26 |
9 | 51 | 30 | 49 | True | 7 | 21 |
Note
I often use the suffix _mask
when I create boolean indexes.
It is not required, but utilizing naming conventions makes your code easier to understand by yourself and others.
Now, if we also wanted to split by the initial_viral_load we might do:
high_vl_mask = trial_df['initial_viral_load'] > 50
aged_high_vl = trial_df.loc[age_mask & high_vl_mask]
aged_high_vl.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
10 | 55 | 21 | 94 | False | 3 | 34 |
11 | 53 | 42 | 85 | True | 5 | 11 |
14 | 56 | 41 | 59 | False | 6 | 15 |
24 | 51 | 43 | 88 | False | 2 | 8 |
# ~ can be used to say "not"
aged_low_vl = trial_df.loc[age_mask & ~high_vl_mask]
aged_low_vl.head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
6 | 55 | 31 | 23 | False | 4 | 24 |
7 | 56 | 50 | 22 | False | 4 | 6 |
8 | 59 | 33 | 33 | False | 5 | 26 |
9 | 51 | 30 | 49 | True | 7 | 21 |
15 | 53 | 47 | 38 | True | 7 | 6 |
Q3: Calculate the average weeks to failure for the treated population?#
treated_mask = trial_df['treatment'] == True # SOLUTION NO PROMPT
treated_average_weeks = trial_df.loc[treated_mask, 'weeks_to_failure'].mean() # SOLUTION
grader.check("q3_treated_weeks_to_failure_indexing")
Utilizing boolean indexing you can express any algorithmic row selecting strategy. This can even include comparisons between rows, for example if there were multiple rows of the same sample. We will cover these strategies later in the course.
Sometimes, our searches are simple.
Pandas also includes another method for indexing rows called .query()
for these purposes.
Querying#
.query()
is an interface that facilitates simple queries qith a few specific limitations:
It can only use the information present in the row.
It can only work on one row at a time.
Column headers cannot contain spaces, dots, dashes, commas, or emoji.
Our questions on this dataset easily fit within those constraints.
# All treatment rows
trial_df.query('treatment == True').head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
2 | 45 | 36 | 32 | True | 6 | 9 |
4 | 40 | 20 | 45 | True | 5 | 20 |
5 | 42 | 20 | 57 | True | 9 | 22 |
9 | 51 | 30 | 49 | True | 7 | 21 |
11 | 53 | 42 | 85 | True | 5 | 11 |
trial_df.query('treatment == False').head()
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
0 | 55 | 26 | 66 | False | 3 | 29 |
1 | 48 | 26 | 66 | False | 4 | 22 |
3 | 43 | 31 | 23 | False | 5 | 12 |
6 | 55 | 31 | 23 | False | 4 | 24 |
7 | 56 | 50 | 22 | False | 4 | 6 |
You can also make them more complex.
trial_df.query('age > 33 & treatment == True')
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
2 | 45 | 36 | 32 | True | 6 | 9 |
4 | 40 | 20 | 45 | True | 5 | 20 |
5 | 42 | 20 | 57 | True | 9 | 22 |
9 | 51 | 30 | 49 | True | 7 | 21 |
11 | 53 | 42 | 85 | True | 5 | 11 |
12 | 40 | 34 | 27 | True | 8 | 6 |
15 | 53 | 47 | 38 | True | 7 | 6 |
16 | 57 | 41 | 42 | True | 8 | 16 |
21 | 48 | 37 | 99 | True | 8 | 11 |
23 | 43 | 34 | 48 | True | 7 | 9 |
27 | 45 | 25 | 87 | True | 5 | 20 |
29 | 51 | 43 | 38 | True | 8 | 8 |
This statement doesn’t make a “biological sense”, but it is an example of a valid comparison.
trial_df.query('age >= initial_viral_load')
age | age_initial_infection | initial_viral_load | treatment | weeks_to_failure | years_infected | |
---|---|---|---|---|---|---|
2 | 45 | 36 | 32 | True | 6 | 9 |
3 | 43 | 31 | 23 | False | 5 | 12 |
6 | 55 | 31 | 23 | False | 4 | 24 |
7 | 56 | 50 | 22 | False | 4 | 6 |
8 | 59 | 33 | 33 | False | 5 | 26 |
9 | 51 | 30 | 49 | True | 7 | 21 |
12 | 40 | 34 | 27 | True | 8 | 6 |
15 | 53 | 47 | 38 | True | 7 | 6 |
16 | 57 | 41 | 42 | True | 8 | 16 |
18 | 51 | 42 | 25 | False | 2 | 9 |
19 | 55 | 46 | 45 | False | 1 | 9 |
22 | 51 | 27 | 36 | False | 2 | 24 |
28 | 59 | 40 | 49 | False | 5 | 19 |
29 | 51 | 43 | 38 | True | 8 | 8 |
Q4: Calculate the average weeks_to_failure
for the untreated population?#
# BEGIN SOLUTION NO PROMPT
wanted_samples = trial_df.query('treatment == False')
# END SOLUTION
untreated_average_weeks = wanted_samples['weeks_to_failure'].mean() # SOLUTION
print(f'Untreated participants took {untreated_average_weeks:0.1f} weeks to rebound.')
Untreated participants took 3.6 weeks to rebound.
grader.check("q4_untreated_weeks_to_failure")
Q4: Calculate the average weeks_to_failure
for the treated population?#
# BEGIN SOLUTION NO PROMPT
wanted_samples = trial_df.query('treatment == True')
# END SOLUTION
treated_average_weeks = wanted_samples['weeks_to_failure'].mean() # SOLUTION
print(f'Treated patients took {treated_average_weeks:0.1f} weeks to rebound.')
Treated patients took 6.9 weeks to rebound.
grader.check("q4_treated_weeks_to_failure")
Conclusion#
We can see that this treatment extended the average time off ART from ~3 weeks to ~7 weeks. While not a complete cure, any incremental step is useful progress in the elimination of HIV.
In the lab you will use similar techniques to explore whether other factors in this dataset impact the results. In future weeks we will explore statistical techniques to understand whether this difference is due to chance, or due to the effect of the treatment.