Work in progress!

Walkthrough#

Remember, all assignments are due before the synchronous session.

Introduction#

Get ready to dive into some data analysis as we explore the effectiveness of a hypothetical HIV treatment trial. In this walkthrough, we have a dataset containing information from 30 people living with HIV (PLWH) who were randomly assigned to a treatment or control group. After receiving the treatment, they stopped their ART and were monitored weekly for the number of weeks until their first “detectable” viral load was found. We will use Pandas to analyze this data and evaluate the treatment’s effectiveness. By the end of this activity, you will be proficient in loading spreadsheet data into Python, creating derived columns in DataFrames, and using summary methods like sum, mean, and max. Let’s get started!

Learning Objectives#

At the end of this learning activity you will be able to:

  • Practice loading spreadsheet data into Python using pandas.

  • Use Python methods to create derived columns in pd.DataFrames.

  • Use Pandas summary methods like sum, mean, and max.

  • Employ basic filtering and data extraction from pandas.

Dataset Reference#

File: trial_data.csv

Columns:

  • age : (years) Current age during the study.

  • age_initial_infection : (years) Age at which the participant was initially infected.

  • initial_viral_load : (copies/ul) The level of infection at the start of the study.

  • treatment : (boolean) True for participant in the treatment group, False for those in the control group.

  • weeks_to_failure : (weeks) Time from the treatment to the first week of uncontrolled viral load.

Imports#

While basic Python can do a lot, you have to do everything yourself. The real power of Python is that you can import code that is written by others.

For this course, we will use a common data science stack of interoperable tools centered around the Numpy.

There are four that we will use regularly, two of which we’ll cover today.

Numpy#

Numpy

A numerical Python library that contains incredibly fast arrays, mathematical functions, and other useful utilities.

By convention, the community tends to alias the long numpy as np.

import numpy as np

Pandas#

Pandas

A libary that sits atop numpy and provides a spreadsheet style object called a DataFrame along with a plethora of data sciecne utilities. This is the main tool we will be using for data exploration.

By convention, the community tends to alias the long pandas as pd.

import pandas as pd

Nicely, it can read csv files for us.

trial_df = pd.read_csv('trial_data.csv')

# If a `DataFrame` is the last line, it will display a nice summary
trial_df
age age_initial_infection initial_viral_load treatment weeks_to_failure
0 55 26 66 False 3
1 48 26 66 False 4
2 45 36 32 True 6
3 43 31 23 False 5
4 40 20 45 True 5
5 42 20 57 True 9
6 55 31 23 False 4
7 56 50 22 False 4
8 59 33 33 False 5
9 51 30 49 True 7
10 55 21 94 False 3
11 53 42 85 True 5
12 40 34 27 True 8
13 48 41 99 False 3
14 56 41 59 False 6
15 53 47 38 True 7
16 57 41 42 True 8
17 48 33 57 False 4
18 51 42 25 False 2
19 55 46 45 False 1
20 43 24 46 False 1
21 48 37 99 True 8
22 51 27 36 False 2
23 43 34 48 True 7
24 51 43 88 False 2
25 49 20 76 False 5
26 54 47 74 False 5
27 45 25 87 True 5
28 59 40 49 False 5
29 51 43 38 True 8

And we should see that this exactly matches the table we saw in Excel.

The object we got back is called a DataFrame.

type(trial_df)
pandas.core.frame.DataFrame

If we only want to see a small version of the DataFrame we can use the .head() method.

trial_df.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure
0 55 26 66 False 3
1 48 26 66 False 4
2 45 36 32 True 6
3 43 31 23 False 5
4 40 20 45 True 5

Acting on Columns#

We can reference each column by name using square brackets []. For example: Extracting the age column like so:

trial_df['age']
0     55
1     48
2     45
3     43
4     40
5     42
6     55
7     56
8     59
9     51
10    55
11    53
12    40
13    48
14    56
15    53
16    57
17    48
18    51
19    55
20    43
21    48
22    51
23    43
24    51
25    49
26    54
27    45
28    59
29    51
Name: age, dtype: int64

Q1: Extract the initial_viral_load column ?#

init_vl = trial_df['initial_viral_load']  # SOLUTION
print('init_vl is a `pd.Series`:', isinstance(init_vl, pd.Series))
init_vl is a `pd.Series`: True
print(f'init_vl_sum = {init_vl.sum()}')
init_vl_sum = 1628

Once we can extract columns, we can start summarizing them.

age_col = trial_df['age']
age_mean = age_col.mean()
print(f'The mean age of the population is {age_mean:0.1f} yrs.')
The mean age of the population is 50.1 yrs.

Expressions can also be chained. They are functionally the same, the only difference is aesthetic.

age_mean_short = trial_df['age'].mean()
print(f'The mean age of the population is {age_mean_short:0.1f} yrs, even when done on a single line.')
The mean age of the population is 50.1 yrs, even when done on a single line.

Q2: Calculate the average weeks_to_failure for the whole population?#

average_weeks = trial_df['weeks_to_failure'].mean()  # SOLUTION
print(f'average_weeks = {average_weeks:0.1f}')
average_weeks = 4.9

We can also summarize an entire DataFrame with a single command.

trial_df.mean()
age                      50.133333
age_initial_infection    34.366667
initial_viral_load       54.266667
treatment                 0.400000
weeks_to_failure          4.900000
dtype: float64

In this case the summary went down the columns and calculated a mean for each.

There are a number of other summarization methods.

  • max()

  • min()

  • mode()

  • median()

  • var()

  • std()

  • nunique()

trial_df.describe()
age age_initial_infection initial_viral_load weeks_to_failure
count 30.000000 30.000000 30.000000 30.000000
mean 50.133333 34.366667 54.266667 4.900000
std 5.569209 9.041984 24.070204 2.202663
min 40.000000 20.000000 22.000000 1.000000
25% 45.750000 26.250000 36.500000 3.250000
50% 51.000000 34.000000 48.500000 5.000000
75% 55.000000 41.750000 72.000000 6.750000
max 59.000000 50.000000 99.000000 9.000000

Selecting columns is nice. We can also add a new column based on another one.

In HIV research it is often important to know how long someone has been living with HIV. However, this dataset contains their current age, and their age at infection. We can use these two to calculate the length.

# first make a new `Series`
years_infected = trial_df['age'] - trial_df['age_initial_infection']

# Then add that series into the table
trial_df['years_infected'] = years_infected
trial_df.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
1 48 26 66 False 4 22
2 45 36 32 True 6 9
3 43 31 23 False 5 12
4 40 20 45 True 5 20
# Alternatively
trial_df['years_infected'] = trial_df['age'] - trial_df['age_initial_infection']
trial_df.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
1 48 26 66 False 4 22
2 45 36 32 True 6 9
3 43 31 23 False 5 12
4 40 20 45 True 5 20

Acting on Rows#

Indexing#

When selecting rows, or rows and columns, we need to use the .loc attribute of the DataFrame.

We can select by row number.

trial_df.loc[0]
age                         55
age_initial_infection       26
initial_viral_load          66
treatment                False
weeks_to_failure             3
years_infected              29
Name: 0, dtype: object
# We can use a : to indicate a range.
trial_df.loc[0:10]
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
1 48 26 66 False 4 22
2 45 36 32 True 6 9
3 43 31 23 False 5 12
4 40 20 45 True 5 20
5 42 20 57 True 9 22
6 55 31 23 False 4 24
7 56 50 22 False 4 6
8 59 33 33 False 5 26
9 51 30 49 True 7 21
10 55 21 94 False 3 34
# We can provide an arbitrary list
trial_df.loc[[0, 5, 7, 13]]
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
5 42 20 57 True 9 22
7 56 50 22 False 4 6
13 48 41 99 False 3 7
# We can also select columns at the same time.
trial_df.loc[[0, 5, 7, 13], ['initial_viral_load', 'age']]
initial_viral_load age
0 66 55
5 57 42
7 22 56
13 99 48

Boolean Indexing#

If we do not know the row number ahead of time, but instead want to select rows based on their values, we can using boolean indexing. In this stragey we create a new pd.Series of True/False values where True corresponds to the ones we want.

Start by finding everyone over 50 years old.

age_mask = trial_df['age'] > 50
aged_samples = trial_df.loc[age_mask]
aged_samples.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
6 55 31 23 False 4 24
7 56 50 22 False 4 6
8 59 33 33 False 5 26
9 51 30 49 True 7 21

Now, if we also wanted to split by the initial_viral_load we might do:

high_vl_mask = trial_df['initial_viral_load'] > 50
aged_high_vl = trial_df.loc[age_mask & high_vl_mask]
aged_high_vl.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
10 55 21 94 False 3 34
11 53 42 85 True 5 11
14 56 41 59 False 6 15
24 51 43 88 False 2 8
# ~ can be used to say "not"
aged_low_vl = trial_df.loc[age_mask & ~high_vl_mask]
aged_low_vl.head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
6 55 31 23 False 4 24
7 56 50 22 False 4 6
8 59 33 33 False 5 26
9 51 30 49 True 7 21
15 53 47 38 True 7 6

Q3: Calculate the average weeks to failure for the treated population?#

treated_mask = trial_df['treatment'] == True  # SOLUTION NO PROMPT
treated_average_weeks = trial_df.loc[treated_mask, 'weeks_to_failure'].mean()  # SOLUTION
print(f'treated_average_weeks = {treated_average_weeks:0.1f}')
treated_average_weeks = 6.9

Utilizing boolean indexing you can express any algorithmic row selecting strategy. This can even include comparisons between rows, for example if there were multiple rows of the same sample. We will cover these strategies later in the course.

Sometimes, our searches are simple. Pandas also includes another method for indexing rows called .query() for these purposes.

Querying#

.query() is an interface that facilitates simple queries qith a few specific limitations:

  • It can only use the information present in the row.

  • It can only work on one row at a time.

  • Column headers cannot contain spaces, dots, dashes, commas, or emoji.

Our questions on this dataset easily fit within those constraints.

# All treatment rows
trial_df.query('treatment == True').head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
2 45 36 32 True 6 9
4 40 20 45 True 5 20
5 42 20 57 True 9 22
9 51 30 49 True 7 21
11 53 42 85 True 5 11
trial_df.query('treatment == False').head()
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
0 55 26 66 False 3 29
1 48 26 66 False 4 22
3 43 31 23 False 5 12
6 55 31 23 False 4 24
7 56 50 22 False 4 6

You can also make them more complex.

trial_df.query('age > 33 & treatment == True')
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
2 45 36 32 True 6 9
4 40 20 45 True 5 20
5 42 20 57 True 9 22
9 51 30 49 True 7 21
11 53 42 85 True 5 11
12 40 34 27 True 8 6
15 53 47 38 True 7 6
16 57 41 42 True 8 16
21 48 37 99 True 8 11
23 43 34 48 True 7 9
27 45 25 87 True 5 20
29 51 43 38 True 8 8

This statement doesn’t make a “biological sense”, but it is an example of a valid comparison.

trial_df.query('age >= initial_viral_load')
age age_initial_infection initial_viral_load treatment weeks_to_failure years_infected
2 45 36 32 True 6 9
3 43 31 23 False 5 12
6 55 31 23 False 4 24
7 56 50 22 False 4 6
8 59 33 33 False 5 26
9 51 30 49 True 7 21
12 40 34 27 True 8 6
15 53 47 38 True 7 6
16 57 41 42 True 8 16
18 51 42 25 False 2 9
19 55 46 45 False 1 9
22 51 27 36 False 2 24
28 59 40 49 False 5 19
29 51 43 38 True 8 8

Q4: Calculate the average weeks_to_failure for the untreated population?#

# BEGIN SOLUTION NO PROMPT

wanted_samples = trial_df.query('treatment == False')

# END SOLUTION

untreated_average_weeks = wanted_samples['weeks_to_failure'].mean()  # SOLUTION
print(f'Untreated participants took {untreated_average_weeks:0.1f} weeks to rebound.')
Untreated participants took 3.6 weeks to rebound.
print('untreated_average_weeks is a `float`:', isinstance(untreated_average_weeks, float))
untreated_average_weeks is a `float`: True
print(f'untreated_average_weeks = {untreated_average_weeks:0.1f}')
untreated_average_weeks = 3.6

Q4: Calculate the average weeks_to_failure for the treated population?#

# BEGIN SOLUTION NO PROMPT

wanted_samples = trial_df.query('treatment == True')

# END SOLUTION

treated_average_weeks = wanted_samples['weeks_to_failure'].mean()  # SOLUTION
print(f'Treated patients took {treated_average_weeks:0.1f} weeks to rebound.')
Treated patients took 6.9 weeks to rebound.
print('treated_average_weeks is a `float`:', isinstance(treated_average_weeks, float))
treated_average_weeks is a `float`: True
print(f'treated_average_weeks = {treated_average_weeks:0.1f}')
treated_average_weeks = 6.9

Conclusion#

We can see that this treatment extended the average time off ART from ~3 weeks to ~7 weeks. While not a complete cure, any incremental step is useful progress in the elimination of HIV.

In the lab you will use similar techniques to explore whether other factors in this dataset impact the results. In future weeks we will explore statistical techniques to understand whether this difference is due to chance, or due to the effect of the treatment.