Preprocessing/QC

This workshop will cover selection and filtration of cells based on QC metrics, data normalization and scaling, and detection of highly variable features.

A few QC metrics commonly used by the community include:

The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes, while cell doublets or multiplets may exhibit an aberrantly high gene count.
The total number of molecules detected within a cell, which correlates strongly with the number of unique genes.
The percentage of reads that map to the mitochondrial genome, where low-quality/dying cells often exhibit extensive mitochondrial contamination.

Interactive Workshop: Preprocessing/QC

Adding slots to the PBMC data object

The [[]] operator can add columns to object metadata. This is a great pace to stash QC stats.

# Adding percent mitochondrial column to metadata data frame 
pbmc[["percent.mt"]] <- PercentFeatureSet(pbmc, pattern = "^MT-")

You can display the object metadata data frame to see QC metrics:

# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)

Output:

Violin Plot

A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. It is used to visualize the distribution of numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Output:

FeatureScatter

FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores, etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

Output:

nCount_RNA & nFeature_RNA are highly correlated (0.95) because cells with more reads usually have more detected genes. percent.mt (percent mitochondrial) and nCount_RNA are not correlated at all (-0.13), meaning mitochondrial contamination is independent of sequencing depth, which helps QC by allowing separate cutoffs for these metrics.

Last updated on Jul 16, 2025