Part 2 — Complete Beginner's Guide

Single-Cell RNA-seq
Quality Control
& Cell Filtering

Learn how to identify and remove low-quality cells, doublets, and dying cells from your scRNA-seq dataset — step by step using Python and Scanpy.

Python 3.9+
Scanpy 1.9+
AnnData
10x Genomics
~50 min read

Why Quality Control
is Non-Negotiable

Garbage in, garbage out — why bad cells ruin good biology

When you sequence single cells, not every "cell" in your data is actually a healthy, intact cell. Your droplet-based protocol (10x Chromium, Drop-seq, etc.) inevitably captures empty droplets, dying cells, doublets (two cells in one droplet), and debris. If you analyze these alongside real cells, you'll see:

Dying Cells
High mitochondrial gene expression. Cell membrane rupture leaks cytoplasmic RNA, leaving only mt-RNA behind.
HIGH MT%
Empty Droplets
Droplets with no real cell — just ambient RNA from the solution. Very few genes, very few counts.
LOW COUNTS
Doublets
Two cells captured together. Appears as one "super cell" with abnormally high gene count and UMIs.
HIGH GENES
Low-quality Cells
Stressed or damaged cells with unusual gene expression profiles that don't represent real biology.
OUTLIER

Without QC: Clustering can be driven entirely by technical artifacts. You might report a "novel cell type" that is actually just dying cells or doublets. QC is the most important step in the entire pipeline.

Environment Setup

Installing Python, Scanpy, and all dependencies

We'll use Scanpy — the most popular Python toolkit for single-cell analysis. It's built around the AnnData data structure and integrates seamlessly with the broader scverse ecosystem.

We recommend using conda or mamba for environment management, and running code in Jupyter Lab or VS Code with the Jupyter extension.

terminal bash
# Create a new conda environment
conda create -n scrna python=3.10 -y
conda activate scrna

# Install core packages
pip install scanpy anndata pandas numpy matplotlib seaborn
pip install scipy scikit-learn umap-learn leidenalg
pip install scrublet doubletdetection  # doublet detection tools

# Optional but recommended
pip install jupyterlab ipywidgets
pip install cellxgene  # interactive exploration
00_setup.py python
import scanpy as sc
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ── Scanpy settings ─────────────────────────────────────────
sc.settings.verbosity = 3             # 0=error, 1=warn, 2=info, 3=hint
sc.settings.set_figure_params(
    dpi       = 100,
    facecolor = 'white',
    figsize   = (6, 4)
)

# Set random seed for reproducibility
np.random.seed(42)

# Verify versions
print(f"Scanpy version: {sc.__version__}")
print(f"AnnData version: {ad.__version__}")

Loading Your scRNA-seq Data

Reading 10x CellRanger output and other common formats

Scanpy can load data from virtually any format. The most common source is 10x Genomics CellRanger output, which produces a folder containing three files: matrix.mtx.gz, barcodes.tsv.gz, and features.tsv.gz.

01_load_data.py python
# ── Option A: Load 10x CellRanger output (most common) ──────
# Path to the folder containing matrix.mtx.gz, barcodes.tsv.gz, features.tsv.gz
adata = sc.read_10x_mtx(
    path       = "data/sample1/filtered_feature_bc_matrix/",
    var_names  = 'gene_symbols',   # use gene symbols (e.g. "ACTB") not Ensembl IDs
    cache      = True               # cache for faster re-loading
)

# ── Option B: Load 10x HDF5 format (.h5) ────────────────────
adata = sc.read_10x_h5("data/sample1/filtered_feature_bc_matrix.h5")

# ── Option C: Load pre-existing AnnData (.h5ad) ─────────────
adata = sc.read_h5ad("data/my_dataset.h5ad")

# ── Option D: Load from CSV matrix ──────────────────────────
counts_df = pd.read_csv("counts_matrix.csv", index_col=0)
adata = ad.AnnData(X = counts_df.values.T)  # transpose: cells × genes
adata.obs_names = counts_df.columns.tolist()
adata.var_names = counts_df.index.tolist()

# ── Inspect the AnnData object ───────────────────────────────
print(adata)
# AnnData object with n_obs × n_vars = 12000 × 33538
#     obs: 'barcodes'
#     var: 'gene_ids', 'feature_types'

print(f"Cells: {adata.n_obs:,}")    # rows = cells
print(f"Genes: {adata.n_vars:,}")    # columns = genes

# Make gene names unique (10x can have duplicate names)
adata.var_names_make_unique()
print(adata.var_names[:5])  # preview first 5 gene names
output
AnnData object with n_obs × n_vars = 12847 × 33538
obs: 'barcodes'
var: 'gene_ids', 'feature_types', 'genome'
Cells: 12,847
Genes: 33,538
Index(['MIR1302-2HG', 'FAM138A', 'OR4F5', 'AL627309.1', 'AL627309.3'], dtype='object')
📐

AnnData anatomy: adata.X = count matrix (cells × genes). adata.obs = cell metadata (DataFrame). adata.var = gene metadata (DataFrame). adata.obsm = dimensionality reductions (PCA, UMAP). Everything lives in one object.

The Three Core QC Metrics

What to measure, why it matters, and what thresholds to expect

Every scRNA-seq QC pipeline centers on three fundamental metrics. Understanding what they tell you biologically is crucial before you start filtering.

MetricWhat it measuresLow value meansHigh value meansTypical range
n_genes_by_counts Number of unique genes detected per cell Empty droplet or dying cell Doublet (two cells) 200 – 6,000
total_counts Total UMI (unique molecular identifiers) per cell Low-quality / empty Doublet or very active cell 500 – 30,000
pct_counts_mt % of UMIs from mitochondrial genes Healthy cell (good!) Dying/stressed cell < 5–25% (tissue-dependent)

Mitochondrial thresholds are tissue-specific! Neurons and cardiomyocytes naturally have high mitochondrial activity. A 25% MT cutoff appropriate for PBMCs may discard most real neurons. Always contextualize your thresholds.

Computing QC Metrics with Scanpy

One function to calculate everything you need

02_compute_qc.py python
# ── Step 1: Annotate mitochondrial genes ────────────────────
# Human MT genes start with "MT-", mouse with "mt-"
adata.var['mt'] = adata.var_names.str.startswith('MT-')

# Also annotate ribosomal genes (optional but useful)
adata.var['ribo'] = adata.var_names.str.startswith(('RPS', 'RPL'))

# How many MT genes were found?
print(f"MT genes found: {adata.var['mt'].sum()}")

# ── Step 2: Calculate QC metrics ────────────────────────────
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars   = ['mt', 'ribo'],  # compute % of MT and ribosomal genes
    percent_top = [20, 50, 100, 200],  # % of counts in top N genes
    log1p     = True,              # also compute log1p versions
    inplace   = True               # add to adata.obs and adata.var
)

# ── Step 3: Inspect what was computed ───────────────────────
print("\nCell-level QC columns (adata.obs):")
print(adata.obs.columns.tolist())
# ['n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts',
#  'log1p_total_counts', 'pct_counts_in_top_20_genes',
#  'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt',
#  'total_counts_ribo', 'pct_counts_ribo']

# Summary statistics per metric
qc_cols = ['n_genes_by_counts', 'total_counts', 'pct_counts_mt']
print(adata.obs[qc_cols].describe().round(2))
▶ output
MT genes found: 37

n_genes_by_counts total_counts pct_counts_mt
count 12847.00 12847.00 12847.00
mean 2145.32 6823.44 8.31
std 1204.57 5221.88 9.47
min 12.00 23.00 0.00
25% 1321.00 3104.00 2.14
50% 1987.00 5410.00 5.82
75% 2834.00 9221.00 12.45
max 9821.00 84320.00 97.23

Visualizing QC Distributions

Always look before you filter — violin plots, scatter plots, and histograms

Never apply thresholds blind. The distribution shapes will tell you a lot about your data quality and help you pick appropriate cutoffs. Use violin plots to see the spread, and scatter plots to identify doublets and dying cells.

03_visualize_qc.py python
# ── 1. Scanpy built-in violin plots ─────────────────────────
sc.pl.violin(
    adata,
    keys    = ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
    jitter  = 0.4,    # show individual cells as dots
    multi_panel = True
)

# ── 2. Scatter: total_counts vs n_genes (doublet detection) ─
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts',
               color='pct_counts_mt')

# ── 3. Custom matplotlib violin plot (more control) ─────────
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('QC Metric Distributions', fontsize=14, fontweight='bold')

metrics = [
    ('n_genes_by_counts', 'Genes per Cell',   '#e85d3a'),
    ('total_counts',      'Total UMIs',        '#1a8a7a'),
    ('pct_counts_mt',     '% Mitochondrial',   '#c07b2a'),
]

for ax, (col, title, color) in zip(axes, metrics):
    data = adata.obs[col]
    ax.violinplot([data], positions=[1],
                  showmedians=True, showextrema=True)
    ax.scatter(
        np.random.normal(1, 0.03, size=min(2000, len(data))),
        np.random.choice(data, min(2000, len(data)), replace=False),
        alpha=0.2, s=2, color=color
    )
    ax.set_title(title, fontweight='bold')
    ax.set_xticks([])

plt.tight_layout()
plt.savefig('qc_violins.pdf', dpi=150, bbox_inches='tight')
plt.show()

# ── 4. Log-scale histogram (great for seeing tail cells) ────
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(adata.obs['total_counts'], bins=100,
        color='#1a8a7a', edgecolor='white', linewidth=0.3)
ax.set_xlabel('Total UMI Counts')
ax.set_ylabel('Number of Cells')
ax.set_yscale('log')
ax.axvline(500, color='red', linestyle='--', label='min threshold = 500')
ax.legend()
plt.show()

What to look for: In the total_counts vs n_genes scatter plot, real cells form a diagonal cloud. Cells far above the diagonal (high counts, low genes) are likely empty droplets or debris. Cells at the very top-right are likely doublets.

Example QC distribution summary — 12,847 cells before filtering
Median genes/cell
1,987
Median UMIs/cell
5,410
Cells with MT% > 25%
1,142
Suspected doublets
387
Very low-count cells
628

Setting Filtering Thresholds

How to choose sensible cutoffs for your specific dataset

There is no single "correct" threshold — they depend on your tissue, cell type composition, protocol, and sequencing depth. Here are three evidence-based strategies:

04_set_thresholds.py python
# ── Strategy 1: Manual (literature-based, most common) ──────
MIN_GENES  = 200     # discard cells with fewer than 200 genes
MAX_GENES  = 6000    # discard likely doublets (adjust to your data!)
MIN_COUNTS = 500     # discard very low-count cells
MAX_COUNTS = 40000   # discard outlier high-count cells
MAX_MT_PCT = 20      # discard cells with >20% mitochondrial reads

# ── Strategy 2: MAD-based (data-driven, more principled) ────
# A common rule: filter cells > 3 MADs below/above the median
def mad_filter(data, nmads=3, direction='both'):
    """Return outlier mask using Median Absolute Deviation."""
    median = np.median(data)
    mad    = np.median(np.abs(data - median))
    if direction == 'lower':
        return data < (median - nmads * mad)
    elif direction == 'upper':
        return data > (median + nmads * mad)
    else:
        return (data < (median - nmads * mad)) | (data > (median + nmads * mad))

# Compute outliers
low_genes  = mad_filter(adata.obs['n_genes_by_counts'],  direction='lower')
high_genes = mad_filter(adata.obs['n_genes_by_counts'],  direction='upper')
high_mt    = mad_filter(adata.obs['pct_counts_mt'],       direction='upper')

print(f"Low-gene outliers:  {low_genes.sum()} cells")
print(f"High-gene outliers: {high_genes.sum()} cells (possible doublets)")
print(f"High-MT outliers:   {high_mt.sum()} cells (dying cells)")

# ── Strategy 3: Fixed + adaptive hybrid ─────────────────────
# Use fixed floor (removes debris) + MAD ceiling (adaptive)
min_count = 500    # hard floor
max_count = adata.obs['total_counts'].median() + \
             5 * np.median(np.abs(adata.obs['total_counts'] -
                               adata.obs['total_counts'].median()))
print(f"\nAdaptive max UMI threshold: {max_count:.0f}")
🎛 Example thresholds — which cells pass/fail each filter
n_genes = 180
✗ FAIL
n_genes = 2,400
✓ PASS
pct_mt = 35%
✗ FAIL
pct_mt = 4%
✓ PASS
n_genes = 8,900
✗ DOUBLET

Filtering Cells

Applying thresholds to remove low-quality and anomalous cells

05_filter_cells.py python
# ── Always save original cell count first ───────────────────
n_cells_before = adata.n_obs
print(f"Cells before QC: {n_cells_before:,}")

# ── Method 1: Scanpy built-in (recommended for min/max) ─────
# Filter by minimum genes (removes empty droplets)
sc.pp.filter_cells(adata, min_genes=200)
print(f"After min_genes filter: {adata.n_obs:,} cells")

# ── Method 2: Boolean mask filtering (full control) ─────────
# Build a combined mask — True = cell to KEEP
cell_filter = (
    (adata.obs['n_genes_by_counts']  >= 200   ) &   # min genes
    (adata.obs['n_genes_by_counts']  <= 6000  ) &   # max genes (doublet guard)
    (adata.obs['total_counts']        >= 500   ) &   # min UMIs
    (adata.obs['total_counts']        <= 40000 ) &   # max UMIs
    (adata.obs['pct_counts_mt']       <= 20    )     # max mitochondrial %
)

# How many cells are removed by each filter?
print("\n── Filter breakdown ──")
print(f"Low genes (<200):   {(adata.obs['n_genes_by_counts']<200).sum():>5} cells removed")
print(f"High genes (>6000): {(adata.obs['n_genes_by_counts']>6000).sum():>5} cells removed")
print(f"Low counts (<500):  {(adata.obs['total_counts']<500).sum():>5} cells removed")
print(f"High counts(>40k): {(adata.obs['total_counts']>40000).sum():>5} cells removed")
print(f"High MT% (>20%):    {(adata.obs['pct_counts_mt']>20).sum():>5} cells removed")

# Apply the filter
adata = adata[cell_filter].copy()  # .copy() avoids view warnings

print(f"\nCells after QC:  {adata.n_obs:,}")
print(f"Cells removed:   {n_cells_before - adata.n_obs:,} ({(1 - adata.n_obs/n_cells_before)*100:.1f}%)")

# ── Optional: Doublet detection with Scrublet ───────────────
import scrublet as scr

scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
    min_counts=2, min_cells=3,
    min_gene_variability_pctl=85,
    n_prin_comps=30
)

# Add results to adata
adata.obs['doublet_score']     = doublet_scores
adata.obs['predicted_doublet'] = predicted_doublets

print(f"Predicted doublets: {predicted_doublets.sum()} ({predicted_doublets.mean()*100:.1f}%)")

# Remove predicted doublets
adata = adata[~adata.obs['predicted_doublet']].copy()
print(f"Cells after doublet removal: {adata.n_obs:,}")
▶ output
Cells before QC: 12,847
── Filter breakdown ──
Low genes (<200): 628 cells removed
High genes (>6000): 244 cells removed
Low counts (<500): 541 cells removed
High counts(>40k): 89 cells removed
High MT% (>20%): 1,142 cells removed

Cells after QC: 10,203
Cells removed: 2,644 (20.6%)
Predicted doublets: 342 (3.4%)
Cells after doublet removal: 9,861

Filtering Genes

Removing genes detected in too few cells to be meaningful

Genes expressed in only 1–2 cells are biologically uninformative and add noise to downstream analysis (clustering, differential expression). We filter them out to reduce memory usage and improve algorithm performance.

06_filter_genes.py python
n_genes_before = adata.n_vars
print(f"Genes before filtering: {n_genes_before:,}")

# ── Filter: keep genes expressed in at least 10 cells ───────
# This removes very rare genes while preserving biology
sc.pp.filter_genes(adata, min_cells=10)
print(f"Genes after filtering: {adata.n_vars:,}")
print(f"Genes removed:         {n_genes_before - adata.n_vars:,}")

# ── Optionally: remove MT and ribosomal genes from analysis ─
# Some analyses benefit from removing these to focus on gene expression
# variation unrelated to cell stress or translational activity

# Remove mitochondrial genes (optional — dataset-dependent)
non_mt = ~adata.var['mt']
# adata = adata[:, non_mt].copy()  # uncomment to remove MT genes

# ── Check gene detection statistics ─────────────────────────
print("\nGene-level QC stats (adata.var):")
print(adata.var[['n_cells_by_counts', 'mean_counts',
                  'pct_dropout_by_counts']].describe().round(2))

# Plot: how many cells express each gene?
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(adata.var['n_cells_by_counts'], bins=80,
        color='#c07b2a', edgecolor='white', linewidth=0.3)
ax.set_xlabel('Number of cells expressing gene')
ax.set_ylabel('Number of genes')
ax.set_yscale('log')
ax.set_title('Gene Detection Histogram')
plt.show()
output
Genes before filtering: 33,538
Genes after filtering: 19,472
Genes removed: 14,066 (42% — mostly unexpressed)

Normalization After QC

Making cells comparable by correcting for sequencing depth differences

After filtering, cells still have different total UMI counts (sequencing depth). A cell with 10,000 UMIs and a cell with 2,000 UMIs can't be compared directly. We normalize to make them comparable.

07_normalize.py python
# ── Save raw counts before normalization! ───────────────────
# Always preserve raw counts — many downstream tools need them
adata.layers['counts'] = adata.X.copy()   # backup raw counts
adata.raw = adata                           # also set .raw for DGE tools

# ── Step 1: Library size normalization ──────────────────────
# Scale each cell to 10,000 total counts (median normalization)
sc.pp.normalize_total(
    adata,
    target_sum = 1e4,      # normalize to 10,000 counts per cell
    inplace    = True
)

# ── Step 2: Log1p transformation ────────────────────────────
# log(x+1) compresses the dynamic range, makes data more normal
sc.pp.log1p(adata)

# ── Step 3: Identify highly variable genes (HVGs) ───────────
# HVGs carry the most biological signal. Speeds up PCA & clustering.
sc.pp.highly_variable_genes(
    adata,
    min_mean     = 0.0125,  # minimum average expression
    max_mean     = 3,        # maximum average expression
    min_disp     = 0.5,      # minimum normalized dispersion
    n_top_genes  = 2000,     # OR: keep top N most variable genes
    subset       = False     # False = annotate, don't remove yet
)

print(f"Highly variable genes: {adata.var.highly_variable.sum()}")

# Plot the mean-dispersion relationship
sc.pl.highly_variable_genes(adata)

# ── Step 4: Scale to unit variance (zero mean, unit std) ────
# Required for PCA. Apply only to HVGs for speed.
sc.pp.scale(adata, max_value=10)    # clip extreme values to 10

print("\nNormalization complete!")
print(f"adata.X now contains: log-normalized, scaled expression")
print(f"adata.layers['counts'] contains: original raw counts")

Don't forget to save raw counts! Always run adata.layers['counts'] = adata.X.copy() before normalizing. Differential expression tools like DESeq2/edgeR need raw integer counts, not normalized values.

Saving Your Filtered Data

Persisting the QC-filtered AnnData object for downstream analysis

08_save_results.py python
# ── Save full AnnData to .h5ad ───────────────────────────────
output_path = "results/sample1_qc_filtered.h5ad"
adata.write_h5ad(output_path, compression='gzip')
print(f"Saved to: {output_path}")

# ── Save QC report as CSV ────────────────────────────────────
qc_report = adata.obs[[
    'n_genes_by_counts', 'total_counts',
    'pct_counts_mt',     'pct_counts_ribo',
    'doublet_score'
]]
qc_report.to_csv("results/qc_metrics_per_cell.csv")

# ── Print final summary ──────────────────────────────────────
print("\n" + "="*50)
print("  QC FILTERING SUMMARY")
print("="*50)
print(f"  Cells retained:  {adata.n_obs:>8,}")
print(f"  Genes retained:  {adata.n_vars:>8,}")
print(f"  HVGs selected:   {adata.var.highly_variable.sum():>8,}")
print(f"  Median genes/cell: {adata.obs.n_genes_by_counts.median():6.0f}")
print(f"  Median UMI/cell:   {adata.obs.total_counts.median():6.0f}")
print(f"  Median MT%:        {adata.obs.pct_counts_mt.median():6.2f}%")
print("="*50)
print("  ✓ Ready for: PCA → UMAP → Clustering")
print("="*50)
output
==================================================
QC FILTERING SUMMARY
==================================================
Cells retained: 9,861
Genes retained: 19,472
HVGs selected: 2,000
Median genes/cell: 2,102
Median UMI/cell: 5,841
Median MT%: 4.21%
==================================================
Ready for: PCA → UMAP → Clustering
==================================================

Best Practices & Common Mistakes

Lessons from real-world scRNA-seq analysis

DO 01

Always visualize before filtering

Never apply thresholds blindly. Plot violin plots and scatter plots to understand your data's distribution before choosing cutoffs.

DO 02

Save raw counts as a layer

Store adata.layers['counts'] before any normalization. DESeq2, edgeR, and other DGE tools require raw integer counts.

DO 03

Run Scrublet for doublets

Computational doublet detection (Scrublet, DoubletFinder) catches doublets the gene-count threshold misses. Use both approaches.

DO 04

Report how many cells you removed

Always document: cells before QC, cells after each filter, and the final retained count. This is required for Methods sections.

AVOID 01

Don't use identical thresholds for all tissues

Heart tissue, neurons, and PBMCs have very different MT% baselines. A 20% MT cutoff is perfect for blood, too strict for brain.

AVOID 02

Don't filter too aggressively

Losing 40%+ of cells often indicates over-filtering. A good QC should remove 10–25% of cells. Start lenient and tighten if needed.

What Comes Next

Part 1 — Data Loading & Preprocessing

CellRanger output → AnnData object

📍

Part 2 — Quality Control & Cell Filtering (you are here)

QC metrics → filtering → normalization → HVGs

Part 3 — Dimensionality Reduction

PCA → Neighbors graph → UMAP embedding

Part 4 — Clustering & Cell Type Annotation

Leiden clustering → marker genes → cell type labels

Part 5 — Differential Expression

Wilcoxon / DESeq2 → volcano plots → pathway analysis

You're QC ready! Your dataset now has high-quality cells only. The full pipeline was: compute QC metrics → visualize distributions → set thresholds → filter cells → filter genes → normalize → select HVGs → save. Your .h5ad file is ready for Part 3: PCA and UMAP.