Walkthrough

Walkthrough#

Introduction#

Get ready to dive into some data analysis as we explore the effectiveness of a hypothetical HIV treatment trial. In this walkthrough, we have a dataset containing information from 30 people living with HIV (PLWH) who were randomly assigned to a treatment or control group. After receiving the treatment, they stopped their ART and were monitored weekly for the number of weeks until their first “detectable” viral load was found. We will use Pandas to analyze this data and evaluate the treatment’s effectiveness. By the end of this activity, you will be proficient in loading spreadsheet data into Python, creating derived columns in DataFrames, and using summary methods like sum, mean, and max. Let’s get started!

Learning Objectives#

At the end of this learning activity you will be able to:

Practice loading spreadsheet data into Python using pandas.
Use Python methods to create derived columns in pd.DataFrames.
Use Pandas summary methods like sum, mean, and max.
Employ basic filtering and data extraction from pandas.

Dataset Reference#

File: trial_data.csv

Columns:

age : (years) Current age during the study.
age_initial_infection : (years) Age at which the participant was initially infected.
initial_viral_load : (copies/ul) The level of infection at the start of the study.
treatment : (boolean) True for participant in the treatment group, False for those in the control group.
weeks_to_failure : (weeks) Time from the treatment to the first week of uncontrolled viral load.

Imports#

While basic Python can do a lot, you have to do everything yourself. The real power of Python is that you can import code that is written by others.

For this course, we will use a common data science stack of interoperable tools centered around the Numpy.

There are four that we will use regularly, two of which we’ll cover today.

Numpy#

Numpy

A numerical Python library that contains incredibly fast arrays, mathematical functions, and other useful utilities.

By convention, the community tends to alias the long numpy as np.

import numpy as np

Pandas#

Pandas

A libary that sits atop numpy and provides a spreadsheet style object called a DataFrame along with a plethora of data sciecne utilities. This is the main tool we will be using for data exploration.

By convention, the community tends to alias the long pandas as pd.

import pandas as pd

Nicely, it can read csv files for us.

trial_df = pd.read_csv('trial_data.csv')

# If a `DataFrame` is the last line, it will display a nice summary
trial_df

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure
0	55	26	66	False	3
1	48	26	66	False	4
2	45	36	32	True	6
3	43	31	23	False	5
4	40	20	45	True	5
5	42	20	57	True	9
6	55	31	23	False	4
7	56	50	22	False	4
8	59	33	33	False	5
9	51	30	49	True	7
10	55	21	94	False	3
11	53	42	85	True	5
12	40	34	27	True	8
13	48	41	99	False	3
14	56	41	59	False	6
15	53	47	38	True	7
16	57	41	42	True	8
17	48	33	57	False	4
18	51	42	25	False	2
19	55	46	45	False	1
20	43	24	46	False	1
21	48	37	99	True	8
22	51	27	36	False	2
23	43	34	48	True	7
24	51	43	88	False	2
25	49	20	76	False	5
26	54	47	74	False	5
27	45	25	87	True	5
28	59	40	49	False	5
29	51	43	38	True	8

Note

I often use the suffix _df when I create pd.DataFrames. It is not required, but utilizing naming conventions makes your code easier to understand by yourself and others.

And we should see that this exactly matches the table we saw in Excel.

The object we got back is called a DataFrame.

type(trial_df)

pandas.core.frame.DataFrame

If we only want to see a small version of the DataFrame we can use the .head() method.

trial_df.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure
0	55	26	66	False	3
1	48	26	66	False	4
2	45	36	32	True	6
3	43	31	23	False	5
4	40	20	45	True	5

Acting on Columns#

We can reference each column by name using square brackets []. For example: Extracting the age column like so:

trial_df['age']

   55
   48
   45
   43
   40
   42
   55
   56
   59
   51
  55
  53
  40
  48
  56
  53
  57
  48
  51
  55
  43
  48
  51
  43
  51
  49
  54
  45
  59
  51
Name: age, dtype: int64

Q1: Extract the `initial_viral_load` column ?#

init_vl = trial_df['initial_viral_load']  # SOLUTION

grader.check("q1_init_vl")

Once we can extract columns, we can start summarizing them.

age_col = trial_df['age']
age_mean = age_col.mean()
print(f'The mean age of the population is {age_mean:0.1f} yrs.')

The mean age of the population is 50.1 yrs.

Expressions can also be chained. They are functionally the same, the only difference is aesthetic.

age_mean_short = trial_df['age'].mean()
print(f'The mean age of the population is {age_mean_short:0.1f} yrs, even when done on a single line.')

The mean age of the population is 50.1 yrs, even when done on a single line.

Q2: Calculate the average `weeks_to_failure` for the whole population?#

average_weeks = trial_df['weeks_to_failure'].mean()  # SOLUTION

grader.check("q2_pop_weeks_to_failure")

We can also summarize an entire DataFrame with a single command.

trial_df.mean()

age                      50.133333
age_initial_infection    34.366667
initial_viral_load       54.266667
treatment                 0.400000
weeks_to_failure          4.900000
dtype: float64

In this case the summary went down the columns and calculated a mean for each.

There are a number of other summarization methods.

max()
min()
mode()
median()
var()
std()
nunique()

Note

Methods, are functions that are attached to an object. They usually act on the object to provide a summary, perform a transformation, or otherwise utilize the information within the object. In this case, these summarization methods utilize the information within the trial_df dataframe to summarize each column.

trial_df.describe()

	age	age_initial_infection	initial_viral_load	weeks_to_failure
count	30.000000	30.000000	30.000000	30.000000
mean	50.133333	34.366667	54.266667	4.900000
std	5.569209	9.041984	24.070204	2.202663
min	40.000000	20.000000	22.000000	1.000000
25%	45.750000	26.250000	36.500000	3.250000
50%	51.000000	34.000000	48.500000	5.000000
75%	55.000000	41.750000	72.000000	6.750000
max	59.000000	50.000000	99.000000	9.000000

Selecting columns is nice. We can also add a new column based on another one.

In HIV research it is often important to know how long someone has been living with HIV. However, this dataset contains their current age, and their age at infection. We can use these two to calculate the length.

# first make a new `Series`
years_infected = trial_df['age'] - trial_df['age_initial_infection']

# Then add that series into the table
trial_df['years_infected'] = years_infected
trial_df.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
1	48	26	66	False	4	22
2	45	36	32	True	6	9
3	43	31	23	False	5	12
4	40	20	45	True	5	20

# Alternatively
trial_df['years_infected'] = trial_df['age'] - trial_df['age_initial_infection']
trial_df.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
1	48	26	66	False	4	22
2	45	36	32	True	6	9
3	43	31	23	False	5	12
4	40	20	45	True	5	20

Acting on Rows#

Indexing#

When selecting rows, or rows and columns, we need to use the .loc attribute of the DataFrame.

We can select by row number.

trial_df.loc[0]

age                         55
age_initial_infection       26
initial_viral_load          66
treatment                False
weeks_to_failure             3
years_infected              29
Name: 0, dtype: object

# We can use a : to indicate a range.
trial_df.loc[0:10]

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
1	48	26	66	False	4	22
2	45	36	32	True	6	9
3	43	31	23	False	5	12
4	40	20	45	True	5	20
5	42	20	57	True	9	22
6	55	31	23	False	4	24
7	56	50	22	False	4	6
8	59	33	33	False	5	26
9	51	30	49	True	7	21
10	55	21	94	False	3	34

# We can provide an arbitrary list
trial_df.loc[[0, 5, 7, 13]]

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
5	42	20	57	True	9	22
7	56	50	22	False	4	6
13	48	41	99	False	3	7

# We can also select columns at the same time.
trial_df.loc[[0, 5, 7, 13], ['initial_viral_load', 'age']]

	initial_viral_load	age
0	66	55
5	57	42
7	22	56
13	99	48

Boolean Indexing#

If we do not know the row number ahead of time, but instead want to select rows based on their values, we can using boolean indexing. In this stragey we create a new pd.Series of True/False values where True corresponds to the ones we want.

Start by finding everyone over 50 years old.

age_mask = trial_df['age'] > 50
aged_samples = trial_df.loc[age_mask]
aged_samples.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
6	55	31	23	False	4	24
7	56	50	22	False	4	6
8	59	33	33	False	5	26
9	51	30	49	True	7	21

Note

I often use the suffix _mask when I create boolean indexes. It is not required, but utilizing naming conventions makes your code easier to understand by yourself and others.

Now, if we also wanted to split by the initial_viral_load we might do:

high_vl_mask = trial_df['initial_viral_load'] > 50

aged_high_vl = trial_df.loc[age_mask & high_vl_mask]
aged_high_vl.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
10	55	21	94	False	3	34
11	53	42	85	True	5	11
14	56	41	59	False	6	15
24	51	43	88	False	2	8

# ~ can be used to say "not"
aged_low_vl = trial_df.loc[age_mask & ~high_vl_mask]
aged_low_vl.head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
6	55	31	23	False	4	24
7	56	50	22	False	4	6
8	59	33	33	False	5	26
9	51	30	49	True	7	21
15	53	47	38	True	7	6

Q3: Calculate the average weeks to failure for the treated population?#

treated_mask = trial_df['treatment'] == True  # SOLUTION NO PROMPT
treated_average_weeks = trial_df.loc[treated_mask, 'weeks_to_failure'].mean()  # SOLUTION

grader.check("q3_treated_weeks_to_failure_indexing")

Utilizing boolean indexing you can express any algorithmic row selecting strategy. This can even include comparisons between rows, for example if there were multiple rows of the same sample. We will cover these strategies later in the course.

Sometimes, our searches are simple. Pandas also includes another method for indexing rows called .query() for these purposes.

Querying#

.query() is an interface that facilitates simple queries qith a few specific limitations:

It can only use the information present in the row.
It can only work on one row at a time.
Column headers cannot contain spaces, dots, dashes, commas, or emoji.

Our questions on this dataset easily fit within those constraints.

# All treatment rows
trial_df.query('treatment == True').head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
2	45	36	32	True	6	9
4	40	20	45	True	5	20
5	42	20	57	True	9	22
9	51	30	49	True	7	21
11	53	42	85	True	5	11

trial_df.query('treatment == False').head()

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
0	55	26	66	False	3	29
1	48	26	66	False	4	22
3	43	31	23	False	5	12
6	55	31	23	False	4	24
7	56	50	22	False	4	6

You can also make them more complex.

trial_df.query('age > 33 & treatment == True')

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
2	45	36	32	True	6	9
4	40	20	45	True	5	20
5	42	20	57	True	9	22
9	51	30	49	True	7	21
11	53	42	85	True	5	11
12	40	34	27	True	8	6
15	53	47	38	True	7	6
16	57	41	42	True	8	16
21	48	37	99	True	8	11
23	43	34	48	True	7	9
27	45	25	87	True	5	20
29	51	43	38	True	8	8

This statement doesn’t make a “biological sense”, but it is an example of a valid comparison.

trial_df.query('age >= initial_viral_load')

	age	age_initial_infection	initial_viral_load	treatment	weeks_to_failure	years_infected
2	45	36	32	True	6	9
3	43	31	23	False	5	12
6	55	31	23	False	4	24
7	56	50	22	False	4	6
8	59	33	33	False	5	26
9	51	30	49	True	7	21
12	40	34	27	True	8	6
15	53	47	38	True	7	6
16	57	41	42	True	8	16
18	51	42	25	False	2	9
19	55	46	45	False	1	9
22	51	27	36	False	2	24
28	59	40	49	False	5	19
29	51	43	38	True	8	8

Q4: Calculate the average `weeks_to_failure` for the untreated population?#

# BEGIN SOLUTION NO PROMPT

wanted_samples = trial_df.query('treatment == False')

# END SOLUTION

untreated_average_weeks = wanted_samples['weeks_to_failure'].mean()  # SOLUTION

print(f'Untreated participants took {untreated_average_weeks:0.1f} weeks to rebound.')

Untreated participants took 3.6 weeks to rebound.

grader.check("q4_untreated_weeks_to_failure")

Q4: Calculate the average `weeks_to_failure` for the treated population?#

# BEGIN SOLUTION NO PROMPT

wanted_samples = trial_df.query('treatment == True')

# END SOLUTION

treated_average_weeks = wanted_samples['weeks_to_failure'].mean()  # SOLUTION

print(f'Treated patients took {treated_average_weeks:0.1f} weeks to rebound.')

Treated patients took 6.9 weeks to rebound.

grader.check("q4_treated_weeks_to_failure")

Conclusion#

We can see that this treatment extended the average time off ART from ~3 weeks to ~7 weeks. While not a complete cure, any incremental step is useful progress in the elimination of HIV.

In the lab you will use similar techniques to explore whether other factors in this dataset impact the results. In future weeks we will explore statistical techniques to understand whether this difference is due to chance, or due to the effect of the treatment.

Walkthrough

Contents

Walkthrough#

Introduction#

Learning Objectives#

Dataset Reference#

Imports#

Numpy#

Pandas#

Acting on Columns#

Q1: Extract the initial_viral_load column ?#

Q2: Calculate the average weeks_to_failure for the whole population?#

Acting on Rows#

Indexing#

Boolean Indexing#

Q3: Calculate the average weeks to failure for the treated population?#

Querying#

Q4: Calculate the average weeks_to_failure for the untreated population?#

Q4: Calculate the average weeks_to_failure for the treated population?#

Conclusion#

Q1: Extract the `initial_viral_load` column ?#

Q2: Calculate the average `weeks_to_failure` for the whole population?#

Q4: Calculate the average `weeks_to_failure` for the untreated population?#

Q4: Calculate the average `weeks_to_failure` for the treated population?#