Spreadsheet Organization#
Organizing data before loading it into DataFrames is a critical Step 0. As data-scientists, we can consider the form and shape of the data file.
This book only concerns itself with tabular data. While other forms of data such are important in biology such a phylogenti
Data Form#
In the early 1970’s the computer scientist Edgar F. Codd developed a classification system for data tables [Codd, 1970]. While he developed these to describe relational databases they are also important for spreadsheet style databases.
Haphazard Form#
While not directly described by Codd, the haphazard form is a form I find often in biology labs. Data is not in a strict table format but instead organized loosely across cells, often making it difficult to parse programmatically.
A |
B |
---|---|
Gene Name |
BRCA1 |
Expression |
High |
Gene Name |
TP53 |
Expression |
Moderate |
Issues:
This format lacks structure, making automated data extraction error-prone.
It’s equivalent to no normalization.
Non-Normalized (Unnormalized Form)#
Data is stored in a single table but contains redundant or nested data.
Gene |
Sample IDs |
Expression Levels |
---|---|---|
BRCA1 |
S1, S2, S3 |
High, Low, High |
TP53 |
S1, S2, S3 |
Moderate, Low, High |
Issues:
Repetition of information within cells.
Difficult to filter or query specific combinations of data.
Doesn’t separate repeated data into distinct rows.
First Normal Form (1NF)#
Data is organized into a table where each cell contains a single value, and each row is unique.
Gene |
Sample ID |
Expression Level |
---|---|---|
BRCA1 |
S1 |
High |
BRCA1 |
S2 |
Low |
BRCA1 |
S3 |
High |
TP53 |
S1 |
Moderate |
TP53 |
S2 |
Low |
TP53 |
S3 |
High |
Advantages:
Eliminates multi-valued attributes
data is easier to query and process programmatically.
Goal
This is the ideal format for Pandas DataFrames.
Even more forms#
Codd described additional normalization schemes:
UNF: Unnormalized form
1NF: First normal form
2NF: Second normal form
3NF: Third normal form
EKNF: Elementary key normal form
BCNF: Boyce–Codd normal form
4NF: Fourth normal form
ETNF: Essential tuple normal form
5NF: Fifth normal form
DKNF: Domain-key normal form
6NF: Sixth normal form
But these are more relavant to relational databases and are outside the scope of this course.
Data Shape#
There are two data shapes, wide and long.
These two formats will be covered in greater depth in Module 4 with pd.pivot_table
and pd.melt
.
Wide Format#
In wide format, each subject’s repeated measurements (e.g., blood pressure at different time points) are stored in separate columns.
Subject_ID |
BP_Time_0 |
BP_Time_1 |
BP_Time_2 |
---|---|---|---|
001 |
120 |
125 |
130 |
002 |
110 |
115 |
120 |
003 |
130 |
135 |
140 |
Long Format#
In long format, each measurement is stored as a separate row, with additional columns identifying the subject and the time point.
Subject_ID |
Time_Point |
Blood_Pressure |
---|---|---|
001 |
0 |
120 |
001 |
1 |
125 |
001 |
2 |
130 |
002 |
0 |
110 |
002 |
1 |
115 |
002 |
2 |
120 |
003 |
0 |
130 |
003 |
1 |
135 |
003 |
2 |
140 |