Single-Cell RNA-seq
Quality Control
& Cell Filtering
Learn how to identify and remove low-quality cells, doublets, and dying cells from your scRNA-seq dataset — step by step using Python and Scanpy.
Why Quality Control
is Non-Negotiable
Garbage in, garbage out — why bad cells ruin good biology
When you sequence single cells, not every "cell" in your data is actually a healthy, intact cell. Your droplet-based protocol (10x Chromium, Drop-seq, etc.) inevitably captures empty droplets, dying cells, doublets (two cells in one droplet), and debris. If you analyze these alongside real cells, you'll see:
Without QC: Clustering can be driven entirely by technical artifacts. You might report a "novel cell type" that is actually just dying cells or doublets. QC is the most important step in the entire pipeline.
Environment Setup
Installing Python, Scanpy, and all dependencies
We'll use Scanpy — the most popular Python toolkit for single-cell analysis. It's built around the AnnData data structure and integrates seamlessly with the broader scverse ecosystem.
We recommend using conda or mamba for environment management, and running code in Jupyter Lab or VS Code with the Jupyter extension.
# Create a new conda environment
conda create -n scrna python=3.10 -y
conda activate scrna
# Install core packages
pip install scanpy anndata pandas numpy matplotlib seaborn
pip install scipy scikit-learn umap-learn leidenalg
pip install scrublet doubletdetection # doublet detection tools
# Optional but recommended
pip install jupyterlab ipywidgets
pip install cellxgene # interactive exploration
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# ── Scanpy settings ─────────────────────────────────────────
sc.settings.verbosity = 3 # 0=error, 1=warn, 2=info, 3=hint
sc.settings.set_figure_params(
dpi = 100,
facecolor = 'white',
figsize = (6, 4)
)
# Set random seed for reproducibility
np.random.seed(42)
# Verify versions
print(f"Scanpy version: {sc.__version__}")
print(f"AnnData version: {ad.__version__}")
Loading Your scRNA-seq Data
Reading 10x CellRanger output and other common formats
Scanpy can load data from virtually any format. The most common source is 10x Genomics CellRanger output, which produces a folder containing three files: matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz.
# ── Option A: Load 10x CellRanger output (most common) ──────
# Path to the folder containing matrix.mtx.gz, barcodes.tsv.gz, features.tsv.gz
adata = sc.read_10x_mtx(
path = "data/sample1/filtered_feature_bc_matrix/",
var_names = 'gene_symbols', # use gene symbols (e.g. "ACTB") not Ensembl IDs
cache = True # cache for faster re-loading
)
# ── Option B: Load 10x HDF5 format (.h5) ────────────────────
adata = sc.read_10x_h5("data/sample1/filtered_feature_bc_matrix.h5")
# ── Option C: Load pre-existing AnnData (.h5ad) ─────────────
adata = sc.read_h5ad("data/my_dataset.h5ad")
# ── Option D: Load from CSV matrix ──────────────────────────
counts_df = pd.read_csv("counts_matrix.csv", index_col=0)
adata = ad.AnnData(X = counts_df.values.T) # transpose: cells × genes
adata.obs_names = counts_df.columns.tolist()
adata.var_names = counts_df.index.tolist()
# ── Inspect the AnnData object ───────────────────────────────
print(adata)
# AnnData object with n_obs × n_vars = 12000 × 33538
# obs: 'barcodes'
# var: 'gene_ids', 'feature_types'
print(f"Cells: {adata.n_obs:,}") # rows = cells
print(f"Genes: {adata.n_vars:,}") # columns = genes
# Make gene names unique (10x can have duplicate names)
adata.var_names_make_unique()
print(adata.var_names[:5]) # preview first 5 gene names
obs: 'barcodes'
var: 'gene_ids', 'feature_types', 'genome'
Cells: 12,847
Genes: 33,538
Index(['MIR1302-2HG', 'FAM138A', 'OR4F5', 'AL627309.1', 'AL627309.3'], dtype='object')
AnnData anatomy: adata.X = count matrix (cells × genes). adata.obs = cell metadata (DataFrame). adata.var = gene metadata (DataFrame). adata.obsm = dimensionality reductions (PCA, UMAP). Everything lives in one object.
The Three Core QC Metrics
What to measure, why it matters, and what thresholds to expect
Every scRNA-seq QC pipeline centers on three fundamental metrics. Understanding what they tell you biologically is crucial before you start filtering.
| Metric | What it measures | Low value means | High value means | Typical range |
|---|---|---|---|---|
| n_genes_by_counts | Number of unique genes detected per cell | Empty droplet or dying cell | Doublet (two cells) | 200 – 6,000 |
| total_counts | Total UMI (unique molecular identifiers) per cell | Low-quality / empty | Doublet or very active cell | 500 – 30,000 |
| pct_counts_mt | % of UMIs from mitochondrial genes | Healthy cell (good!) | Dying/stressed cell | < 5–25% (tissue-dependent) |
Mitochondrial thresholds are tissue-specific! Neurons and cardiomyocytes naturally have high mitochondrial activity. A 25% MT cutoff appropriate for PBMCs may discard most real neurons. Always contextualize your thresholds.
Computing QC Metrics with Scanpy
One function to calculate everything you need
# ── Step 1: Annotate mitochondrial genes ────────────────────
# Human MT genes start with "MT-", mouse with "mt-"
adata.var['mt'] = adata.var_names.str.startswith('MT-')
# Also annotate ribosomal genes (optional but useful)
adata.var['ribo'] = adata.var_names.str.startswith(('RPS', 'RPL'))
# How many MT genes were found?
print(f"MT genes found: {adata.var['mt'].sum()}")
# ── Step 2: Calculate QC metrics ────────────────────────────
sc.pp.calculate_qc_metrics(
adata,
qc_vars = ['mt', 'ribo'], # compute % of MT and ribosomal genes
percent_top = [20, 50, 100, 200], # % of counts in top N genes
log1p = True, # also compute log1p versions
inplace = True # add to adata.obs and adata.var
)
# ── Step 3: Inspect what was computed ───────────────────────
print("\nCell-level QC columns (adata.obs):")
print(adata.obs.columns.tolist())
# ['n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts',
# 'log1p_total_counts', 'pct_counts_in_top_20_genes',
# 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt',
# 'total_counts_ribo', 'pct_counts_ribo']
# Summary statistics per metric
qc_cols = ['n_genes_by_counts', 'total_counts', 'pct_counts_mt']
print(adata.obs[qc_cols].describe().round(2))
n_genes_by_counts total_counts pct_counts_mt
count 12847.00 12847.00 12847.00
mean 2145.32 6823.44 8.31
std 1204.57 5221.88 9.47
min 12.00 23.00 0.00
25% 1321.00 3104.00 2.14
50% 1987.00 5410.00 5.82
75% 2834.00 9221.00 12.45
max 9821.00 84320.00 97.23
Visualizing QC Distributions
Always look before you filter — violin plots, scatter plots, and histograms
Never apply thresholds blind. The distribution shapes will tell you a lot about your data quality and help you pick appropriate cutoffs. Use violin plots to see the spread, and scatter plots to identify doublets and dying cells.
# ── 1. Scanpy built-in violin plots ─────────────────────────
sc.pl.violin(
adata,
keys = ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter = 0.4, # show individual cells as dots
multi_panel = True
)
# ── 2. Scatter: total_counts vs n_genes (doublet detection) ─
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts',
color='pct_counts_mt')
# ── 3. Custom matplotlib violin plot (more control) ─────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('QC Metric Distributions', fontsize=14, fontweight='bold')
metrics = [
('n_genes_by_counts', 'Genes per Cell', '#e85d3a'),
('total_counts', 'Total UMIs', '#1a8a7a'),
('pct_counts_mt', '% Mitochondrial', '#c07b2a'),
]
for ax, (col, title, color) in zip(axes, metrics):
data = adata.obs[col]
ax.violinplot([data], positions=[1],
showmedians=True, showextrema=True)
ax.scatter(
np.random.normal(1, 0.03, size=min(2000, len(data))),
np.random.choice(data, min(2000, len(data)), replace=False),
alpha=0.2, s=2, color=color
)
ax.set_title(title, fontweight='bold')
ax.set_xticks([])
plt.tight_layout()
plt.savefig('qc_violins.pdf', dpi=150, bbox_inches='tight')
plt.show()
# ── 4. Log-scale histogram (great for seeing tail cells) ────
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(adata.obs['total_counts'], bins=100,
color='#1a8a7a', edgecolor='white', linewidth=0.3)
ax.set_xlabel('Total UMI Counts')
ax.set_ylabel('Number of Cells')
ax.set_yscale('log')
ax.axvline(500, color='red', linestyle='--', label='min threshold = 500')
ax.legend()
plt.show()
What to look for: In the total_counts vs n_genes scatter plot, real cells form a diagonal cloud. Cells far above the diagonal (high counts, low genes) are likely empty droplets or debris. Cells at the very top-right are likely doublets.
Setting Filtering Thresholds
How to choose sensible cutoffs for your specific dataset
There is no single "correct" threshold — they depend on your tissue, cell type composition, protocol, and sequencing depth. Here are three evidence-based strategies:
# ── Strategy 1: Manual (literature-based, most common) ──────
MIN_GENES = 200 # discard cells with fewer than 200 genes
MAX_GENES = 6000 # discard likely doublets (adjust to your data!)
MIN_COUNTS = 500 # discard very low-count cells
MAX_COUNTS = 40000 # discard outlier high-count cells
MAX_MT_PCT = 20 # discard cells with >20% mitochondrial reads
# ── Strategy 2: MAD-based (data-driven, more principled) ────
# A common rule: filter cells > 3 MADs below/above the median
def mad_filter(data, nmads=3, direction='both'):
"""Return outlier mask using Median Absolute Deviation."""
median = np.median(data)
mad = np.median(np.abs(data - median))
if direction == 'lower':
return data < (median - nmads * mad)
elif direction == 'upper':
return data > (median + nmads * mad)
else:
return (data < (median - nmads * mad)) | (data > (median + nmads * mad))
# Compute outliers
low_genes = mad_filter(adata.obs['n_genes_by_counts'], direction='lower')
high_genes = mad_filter(adata.obs['n_genes_by_counts'], direction='upper')
high_mt = mad_filter(adata.obs['pct_counts_mt'], direction='upper')
print(f"Low-gene outliers: {low_genes.sum()} cells")
print(f"High-gene outliers: {high_genes.sum()} cells (possible doublets)")
print(f"High-MT outliers: {high_mt.sum()} cells (dying cells)")
# ── Strategy 3: Fixed + adaptive hybrid ─────────────────────
# Use fixed floor (removes debris) + MAD ceiling (adaptive)
min_count = 500 # hard floor
max_count = adata.obs['total_counts'].median() + \
5 * np.median(np.abs(adata.obs['total_counts'] -
adata.obs['total_counts'].median()))
print(f"\nAdaptive max UMI threshold: {max_count:.0f}")
Filtering Cells
Applying thresholds to remove low-quality and anomalous cells
# ── Always save original cell count first ───────────────────
n_cells_before = adata.n_obs
print(f"Cells before QC: {n_cells_before:,}")
# ── Method 1: Scanpy built-in (recommended for min/max) ─────
# Filter by minimum genes (removes empty droplets)
sc.pp.filter_cells(adata, min_genes=200)
print(f"After min_genes filter: {adata.n_obs:,} cells")
# ── Method 2: Boolean mask filtering (full control) ─────────
# Build a combined mask — True = cell to KEEP
cell_filter = (
(adata.obs['n_genes_by_counts'] >= 200 ) & # min genes
(adata.obs['n_genes_by_counts'] <= 6000 ) & # max genes (doublet guard)
(adata.obs['total_counts'] >= 500 ) & # min UMIs
(adata.obs['total_counts'] <= 40000 ) & # max UMIs
(adata.obs['pct_counts_mt'] <= 20 ) # max mitochondrial %
)
# How many cells are removed by each filter?
print("\n── Filter breakdown ──")
print(f"Low genes (<200): {(adata.obs['n_genes_by_counts']<200).sum():>5} cells removed")
print(f"High genes (>6000): {(adata.obs['n_genes_by_counts']>6000).sum():>5} cells removed")
print(f"Low counts (<500): {(adata.obs['total_counts']<500).sum():>5} cells removed")
print(f"High counts(>40k): {(adata.obs['total_counts']>40000).sum():>5} cells removed")
print(f"High MT% (>20%): {(adata.obs['pct_counts_mt']>20).sum():>5} cells removed")
# Apply the filter
adata = adata[cell_filter].copy() # .copy() avoids view warnings
print(f"\nCells after QC: {adata.n_obs:,}")
print(f"Cells removed: {n_cells_before - adata.n_obs:,} ({(1 - adata.n_obs/n_cells_before)*100:.1f}%)")
# ── Optional: Doublet detection with Scrublet ───────────────
import scrublet as scr
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
min_counts=2, min_cells=3,
min_gene_variability_pctl=85,
n_prin_comps=30
)
# Add results to adata
adata.obs['doublet_score'] = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets
print(f"Predicted doublets: {predicted_doublets.sum()} ({predicted_doublets.mean()*100:.1f}%)")
# Remove predicted doublets
adata = adata[~adata.obs['predicted_doublet']].copy()
print(f"Cells after doublet removal: {adata.n_obs:,}")
── Filter breakdown ──
Low genes (<200): 628 cells removed
High genes (>6000): 244 cells removed
Low counts (<500): 541 cells removed
High counts(>40k): 89 cells removed
High MT% (>20%): 1,142 cells removed
Cells after QC: 10,203
Cells removed: 2,644 (20.6%)
Predicted doublets: 342 (3.4%)
Cells after doublet removal: 9,861
Filtering Genes
Removing genes detected in too few cells to be meaningful
Genes expressed in only 1–2 cells are biologically uninformative and add noise to downstream analysis (clustering, differential expression). We filter them out to reduce memory usage and improve algorithm performance.
n_genes_before = adata.n_vars
print(f"Genes before filtering: {n_genes_before:,}")
# ── Filter: keep genes expressed in at least 10 cells ───────
# This removes very rare genes while preserving biology
sc.pp.filter_genes(adata, min_cells=10)
print(f"Genes after filtering: {adata.n_vars:,}")
print(f"Genes removed: {n_genes_before - adata.n_vars:,}")
# ── Optionally: remove MT and ribosomal genes from analysis ─
# Some analyses benefit from removing these to focus on gene expression
# variation unrelated to cell stress or translational activity
# Remove mitochondrial genes (optional — dataset-dependent)
non_mt = ~adata.var['mt']
# adata = adata[:, non_mt].copy() # uncomment to remove MT genes
# ── Check gene detection statistics ─────────────────────────
print("\nGene-level QC stats (adata.var):")
print(adata.var[['n_cells_by_counts', 'mean_counts',
'pct_dropout_by_counts']].describe().round(2))
# Plot: how many cells express each gene?
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(adata.var['n_cells_by_counts'], bins=80,
color='#c07b2a', edgecolor='white', linewidth=0.3)
ax.set_xlabel('Number of cells expressing gene')
ax.set_ylabel('Number of genes')
ax.set_yscale('log')
ax.set_title('Gene Detection Histogram')
plt.show()
Genes after filtering: 19,472
Genes removed: 14,066 (42% — mostly unexpressed)
Normalization After QC
Making cells comparable by correcting for sequencing depth differences
After filtering, cells still have different total UMI counts (sequencing depth). A cell with 10,000 UMIs and a cell with 2,000 UMIs can't be compared directly. We normalize to make them comparable.
# ── Save raw counts before normalization! ───────────────────
# Always preserve raw counts — many downstream tools need them
adata.layers['counts'] = adata.X.copy() # backup raw counts
adata.raw = adata # also set .raw for DGE tools
# ── Step 1: Library size normalization ──────────────────────
# Scale each cell to 10,000 total counts (median normalization)
sc.pp.normalize_total(
adata,
target_sum = 1e4, # normalize to 10,000 counts per cell
inplace = True
)
# ── Step 2: Log1p transformation ────────────────────────────
# log(x+1) compresses the dynamic range, makes data more normal
sc.pp.log1p(adata)
# ── Step 3: Identify highly variable genes (HVGs) ───────────
# HVGs carry the most biological signal. Speeds up PCA & clustering.
sc.pp.highly_variable_genes(
adata,
min_mean = 0.0125, # minimum average expression
max_mean = 3, # maximum average expression
min_disp = 0.5, # minimum normalized dispersion
n_top_genes = 2000, # OR: keep top N most variable genes
subset = False # False = annotate, don't remove yet
)
print(f"Highly variable genes: {adata.var.highly_variable.sum()}")
# Plot the mean-dispersion relationship
sc.pl.highly_variable_genes(adata)
# ── Step 4: Scale to unit variance (zero mean, unit std) ────
# Required for PCA. Apply only to HVGs for speed.
sc.pp.scale(adata, max_value=10) # clip extreme values to 10
print("\nNormalization complete!")
print(f"adata.X now contains: log-normalized, scaled expression")
print(f"adata.layers['counts'] contains: original raw counts")
Don't forget to save raw counts! Always run adata.layers['counts'] = adata.X.copy() before normalizing. Differential expression tools like DESeq2/edgeR need raw integer counts, not normalized values.
Saving Your Filtered Data
Persisting the QC-filtered AnnData object for downstream analysis
# ── Save full AnnData to .h5ad ───────────────────────────────
output_path = "results/sample1_qc_filtered.h5ad"
adata.write_h5ad(output_path, compression='gzip')
print(f"Saved to: {output_path}")
# ── Save QC report as CSV ────────────────────────────────────
qc_report = adata.obs[[
'n_genes_by_counts', 'total_counts',
'pct_counts_mt', 'pct_counts_ribo',
'doublet_score'
]]
qc_report.to_csv("results/qc_metrics_per_cell.csv")
# ── Print final summary ──────────────────────────────────────
print("\n" + "="*50)
print(" QC FILTERING SUMMARY")
print("="*50)
print(f" Cells retained: {adata.n_obs:>8,}")
print(f" Genes retained: {adata.n_vars:>8,}")
print(f" HVGs selected: {adata.var.highly_variable.sum():>8,}")
print(f" Median genes/cell: {adata.obs.n_genes_by_counts.median():6.0f}")
print(f" Median UMI/cell: {adata.obs.total_counts.median():6.0f}")
print(f" Median MT%: {adata.obs.pct_counts_mt.median():6.2f}%")
print("="*50)
print(" ✓ Ready for: PCA → UMAP → Clustering")
print("="*50)
QC FILTERING SUMMARY
==================================================
Cells retained: 9,861
Genes retained: 19,472
HVGs selected: 2,000
Median genes/cell: 2,102
Median UMI/cell: 5,841
Median MT%: 4.21%
==================================================
Ready for: PCA → UMAP → Clustering
==================================================
Best Practices & Common Mistakes
Lessons from real-world scRNA-seq analysis
Always visualize before filtering
Never apply thresholds blindly. Plot violin plots and scatter plots to understand your data's distribution before choosing cutoffs.
Save raw counts as a layer
Store adata.layers['counts'] before any normalization. DESeq2, edgeR, and other DGE tools require raw integer counts.
Run Scrublet for doublets
Computational doublet detection (Scrublet, DoubletFinder) catches doublets the gene-count threshold misses. Use both approaches.
Report how many cells you removed
Always document: cells before QC, cells after each filter, and the final retained count. This is required for Methods sections.
Don't use identical thresholds for all tissues
Heart tissue, neurons, and PBMCs have very different MT% baselines. A 20% MT cutoff is perfect for blood, too strict for brain.
Don't filter too aggressively
Losing 40%+ of cells often indicates over-filtering. A good QC should remove 10–25% of cells. Start lenient and tighten if needed.
What Comes Next
Part 1 — Data Loading & Preprocessing
CellRanger output → AnnData object
Part 2 — Quality Control & Cell Filtering (you are here)
QC metrics → filtering → normalization → HVGs
Part 3 — Dimensionality Reduction
PCA → Neighbors graph → UMAP embedding
Part 4 — Clustering & Cell Type Annotation
Leiden clustering → marker genes → cell type labels
Part 5 — Differential Expression
Wilcoxon / DESeq2 → volcano plots → pathway analysis
You're QC ready! Your dataset now has high-quality cells only. The full pipeline was: compute QC metrics → visualize distributions → set thresholds → filter cells → filter genes → normalize → select HVGs → save. Your .h5ad file is ready for Part 3: PCA and UMAP.