Complete Beginner's Guide

Build Gene Regulatory Networks
from RNA-seq Using GENIE3

A step-by-step walkthrough for absolute beginners — from raw count matrices to inferred regulatory network graphs

R & Python
GENIE3 Algorithm
Bioconductor
~45 min read

Table of Contents

01

What Are Gene Regulatory Networks?

The biological foundation you need before writing any code

A Gene Regulatory Network (GRN) is a map that shows how genes control each other's activity inside a cell. Think of it like a city road map — some genes are "highways" (called transcription factors) that regulate many others, and some are quiet side streets that only affect one or two genes.

Transcription Factor (TF)

A protein that binds to DNA and turns other genes ON or OFF. It's the regulator.

Target Gene

A gene whose expression is controlled by one or more transcription factors.

RNA-seq Expression

Quantitative measurement of how much each gene is expressed across many samples.

Regulatory Edge

A predicted link: "Gene A influences Gene B". Weighted by confidence score.

Why RNA-seq? We can't directly observe gene regulation easily, but we CAN measure gene expression across many samples. GENIE3 reverse-engineers the regulatory network by asking: "Whose expression best predicts this gene's expression?"

From RNA-seq data, GENIE3 infers a ranked list of regulatory links between transcription factors and their target genes. Higher-ranked links are more likely to be real regulatory relationships.

02

How GENIE3 Works

The Random Forest algorithm under the hood

GENIE3 stands for GEne Network Inference with Ensemble of trees, 3rd version. It was published in 2010 by Huynh-Thu et al. and consistently ranks among the top GRN inference methods in benchmark studies (DREAM challenges).

Step 1 — Input: Expression Matrix

Rows = genes, Columns = samples. Each cell = expression value (TPM, CPM, or normalized counts).

Step 2 — For Each Target Gene

GENIE3 treats the target gene as the "response variable" and all other genes (or TFs) as "predictors".

Step 3 — Train a Random Forest

A Random Forest (or Extra-Trees) regression model is trained. The goal: predict the target's expression from all TF expressions.

Step 4 — Extract Feature Importance

The feature importances of the Random Forest become the regulatory weights. High importance = strong predicted regulatory link.

Step 5 — Ranked Network Output

All (regulator → target) pairs are ranked by their combined importance scores. Top links = predicted regulatory relationships.

Key insight: GENIE3 makes no assumption about the direction or form of regulation. It's purely data-driven, which makes it flexible but also means it requires good-quality, diverse expression data.
03

Environment Setup

Installing R, Bioconductor, and the GENIE3 package

GENIE3 is best run in R via the Bioconductor package. Here's how to set everything up from scratch:

You need R ≥ 4.1 and Bioconductor ≥ 3.14. Download R from cran.r-project.org and we recommend using RStudio as your IDE.

1. Install Bioconductor and GENIE3

R
# Install BiocManager if you don't have it
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install GENIE3 from Bioconductor
BiocManager::install("GENIE3")

# Install additional packages we'll use
install.packages(c("igraph", "ggplot2", "dplyr", "reshape2"))

# For better network visualization
BiocManager::install("RCy3")  # optional: Cytoscape integration

2. Verify installation

R
library(GENIE3)
packageVersion("GENIE3")
# Expected output: [1] '1.x.x'
04

Preparing Your RNA-seq Data

Format requirements, normalization, and quality checks

GENIE3 requires a normalized gene expression matrix as input. Raw counts are not appropriate you must normalize first.

Required Format

PropertyRequirementNotes
Matrix shapegenes × samplesRows = genes, Columns = samples
ValuesNormalized expressionTPM, RPKM, CPM, or log-normalized counts
Row namesGene symbols or Ensembl IDsMust be consistent with TF list
Missing valuesNot allowedRemove genes with NAs before running
Minimum samples~20–30+ recommendedMore samples = better inference

Step-by-step data preparation:

R
library(GENIE3)
library(dplyr)

# ── Load your count matrix ──────────────────────────────────
# Replace with your actual file path
counts <- read.csv("raw_counts.csv", row.names = 1)

# Inspect the data
dim(counts)       # rows = genes, cols = samples
head(counts[, 1:5]) # preview first 5 samples

# ── Normalize: CPM (Counts Per Million) ─────────────────────
cpm_normalize <- function(mat) {
  lib_sizes <- colSums(mat)
  sweep(mat, 2, lib_sizes, "/") * 1e6
}

expr_cpm <- cpm_normalize(counts)

# ── Log-transform (recommended for GENIE3) ──────────────────
# Adding 1 prevents log(0) issues
expr_log <- log2(expr_cpm + 1)

# ── Filter lowly expressed genes ────────────────────────────
# Keep genes expressed in at least 10% of samples with CPM > 1
min_samples <- ceiling(0.1 * ncol(expr_log))
keep <- rowSums(expr_cpm > 1) >= min_samples
expr_filtered <- expr_log[keep, ]

cat("Genes before filtering:", nrow(expr_log), "\n")
cat("Genes after filtering: ", nrow(expr_filtered), "\n")

# ── Convert to matrix (GENIE3 requires matrix, not data.frame)
expr_matrix <- as.matrix(expr_filtered)
Using DESeq2? If you already have a DESeqDataSet, extract variance-stabilized values with vst(dds) or rlog(dds) — these are ideal for GENIE3 input.

Optional: Define a list of known Transcription Factors

R
# Providing a TF list dramatically speeds up GENIE3
# and reduces false positives. Highly recommended!

# Option A: Load from a known TF database (e.g., AnimalTFDB, TRRUST)
tf_list <- read.table("human_TFs.txt", header = FALSE)$V1

# Option B: Quick example with a few known TFs
tf_list <- c("TP53", "MYC", "FOXO3", "STAT3", "NF1", "SP1")

# Keep only TFs present in our expression matrix
tf_list <- intersect(tf_list, rownames(expr_matrix))
cat("TFs matched in data:", length(tf_list), "\n")
05

Running GENIE3

The core function call and key parameters explained

With your expression matrix ready, running GENIE3 is a single function call. But understanding the parameters helps you get better results.

R — Core GENIE3 Run
library(GENIE3)

# ── Basic run (no TF list, all genes as regulators) ─────────
weight_matrix <- GENIE3(expr_matrix)

# ── Recommended run with TF list ────────────────────────────
weight_matrix <- GENIE3(
  expr_matrix,
  regulators    = tf_list,     # only use TFs as potential regulators
  targets       = NULL,        # NULL = all genes are potential targets
  treeMethod    = "RF",        # "RF" (Random Forest) or "ET" (Extra-Trees)
  K             = "sqrt",      # features per split: "sqrt", "all", or integer
  nTrees        = 1000,        # number of trees (higher = more stable, slower)
  nCores        = 4,           # parallel cores (use parallel::detectCores()-1)
  verbose       = TRUE         # print progress
)

# weight_matrix is a genes × TFs matrix of importance scores
dim(weight_matrix)   # should be [nGenes × nTFs]
head(weight_matrix[, 1:3])

Parameter Reference

ParameterDefaultWhat it doesRecommendation
regulatorsNULLWhich genes act as potential regulatorsAlways provide a TF list
treeMethod"RF"RF = Random Forest; ET = Extra-Trees (faster)"ET" for large datasets
K"sqrt"Number of candidate features per tree split"sqrt" is usually best
nTrees1000Number of trees in the ensemble500–1000 is sufficient
nCores1Parallel processing coresUse all available − 1
Performance note: With 5,000 genes and 100 samples, GENIE3 can take 30–60 minutes on 1 core. Use nCores = parallel::detectCores() - 1 and treeMethod = "ET" to dramatically speed things up.

Convert the weight matrix to a ranked edge list:

R
# Convert weight matrix to a ranked list of regulatory links
link_list <- getLinkList(weight_matrix)

# Returns data frame with columns: regulatoryGene, targetGene, weight
head(link_list, 10)
#   regulatoryGene  targetGene    weight
# 1          TP53        MYC    0.08432
# 2          MYC         CCND1  0.07891
# 3          STAT3       BCL2   0.07203
# ...

# Save to file
write.csv(link_list, "grn_predictions.csv", row.names = FALSE)
06

Interpreting the Output

Understanding weights, thresholds, and what the numbers mean

GENIE3 outputs a ranked list of regulatory links with a weight score for each (Regulator → Target) pair. Higher weight = stronger predicted regulatory relationship.

Important: GENIE3 weights are relative importance scores, NOT probabilities. They are NOT directly comparable across different datasets. Always interpret them as a ranking, not absolute values.

Choosing a weight threshold — how many edges to keep:

R
library(ggplot2)

# ── Explore weight distribution ─────────────────────────────
summary(link_list$weight)

# Plot weight distribution (look for an elbow)
ggplot(head(link_list, 5000), aes(x = seq_along(weight), y = weight)) +
  geom_line(color = "#00e5a0") +
  geom_hline(yintercept = 0.01, linetype = "dashed", color = "#ff6b6b") +
  labs(title = "GENIE3 Weight Score Distribution",
       x = "Link rank", y = "Importance weight") +
  theme_dark()

# ── Apply a threshold ───────────────────────────────────────
# Option A: Keep top N links
top_links <- head(link_list, 1000)

# Option B: Keep links above a weight threshold
top_links <- link_list[link_list$weight > 0.01, ]

# Option C: Keep top X% by percentile
threshold <- quantile(link_list$weight, 0.95)  # top 5%
top_links  <- link_list[link_list$weight >= threshold, ]

cat("Edges in filtered network:", nrow(top_links), "\n")

Example output preview (top 5 predicted links):

PREDICTED REGULATORY LINKS — sorted by importance weight
TP53 → CDKN1A
0.0843
MYC → CCND1
0.0721
STAT3 → BCL2
0.0612
FOXO3 → BNIP3
0.0488
SP1 → VEGFA
0.0375
07

Visualizing the Network

Using igraph in R to plot your GRN

A ranked edge list is data — a network graph is insight. Here's how to visualize your GRN using igraph in R.

R — Network Visualization
library(igraph)

# ── Build igraph object from top links ──────────────────────
grn_graph <- graph_from_data_frame(
  d        = top_links,
  directed = TRUE,             # GRNs are directed: TF → target
  vertices = NULL
)

# ── Add node metadata ────────────────────────────────────────
# Mark nodes that are transcription factors
V(grn_graph)$is_tf <- V(grn_graph)$name %in% tf_list
V(grn_graph)$color  <- ifelse(V(grn_graph)$is_tf, "#00e5a0", "#00aaff")
V(grn_graph)$size   <- ifelse(V(grn_graph)$is_tf, 12, 6)

# ── Edge weight → width ─────────────────────────────────────
E(grn_graph)$width  <- E(grn_graph)$weight * 50

# ── Plot ─────────────────────────────────────────────────────
plot(
  grn_graph,
  layout         = layout_with_fr(grn_graph),  # Fruchterman-Reingold
  vertex.label   = V(grn_graph)$name,
  vertex.label.cex = 0.7,
  vertex.label.color = "white",
  vertex.frame.color = "transparent",
  edge.arrow.size  = 0.3,
  edge.color       = "#1a3048",
  main             = "Inferred Gene Regulatory Network (GENIE3)",
  bg               = "#060c14"
)

legend("bottomleft",
  legend = c("Transcription Factor", "Target Gene"),
  col    = c("#00e5a0", "#00aaff"),
  pch    = 20, bty = "n"
)

Sample network structure (representative visualization):

TP53
MYC
STAT3
FOXO3
BCL2
CDKN1A
CCND1
VEGFA
BAX
🔬
For publication-quality figures: Export your igraph to Cytoscape using the RCy3 package, or export as GraphML and open in Gephi — both offer far more styling options for polished network figures.

Compute basic network statistics:

R — Network Stats
# ── Key network statistics ───────────────────────────────────
cat("Nodes (genes):", vcount(grn_graph), "\n")
cat("Edges (links):", ecount(grn_graph), "\n")

# Hub genes (highest out-degree = regulate most genes)
out_degree <- degree(grn_graph, mode = "out")
head(sort(out_degree, decreasing = TRUE), 10)

# Most regulated genes (highest in-degree)
in_degree <- degree(grn_graph, mode = "in")
head(sort(in_degree, decreasing = TRUE), 10)

# Network density (0 = sparse, 1 = fully connected)
edge_density(grn_graph)

# Find communities / gene modules
modules <- cluster_louvain(as.undirected(grn_graph))
membership(modules)   # which module each gene belongs to
08

Python Alternative (arboreto)

Running GENIE3/GRNBoost2 in Python using arboreto + scanpy

If you prefer Python, the arboreto library implements GENIE3 and the faster GRNBoost2 algorithm, which is particularly popular for single-cell RNA-seq analysis with scanpy.

bash — Installation
pip install arboreto scanpy anndata pandas numpy networkx
Python — GENIE3 / GRNBoost2
import pandas as pd
import numpy as np
from arboreto.algo import genie3, grnboost2
from arboreto.utils import load_tf_names

# ── Load your expression matrix ─────────────────────────────
# Rows = samples, Columns = genes (transpose of R convention!)
expr_df = pd.read_csv("expression_matrix.csv", index_col=0).T

# Gene names
gene_names = list(expr_df.columns)

# Optional: TF list
tf_names = ["TP53", "MYC", "STAT3", "FOXO3", "SP1"]
tf_names = [tf for tf in tf_names if tf in gene_names]

# ── Run GENIE3 ──────────────────────────────────────────────
network_genie3 = genie3(
    expression_data = expr_df.values,
    gene_names      = gene_names,
    tf_names        = tf_names    # optional but recommended
)

# ── OR run GRNBoost2 (faster, similar accuracy) ─────────────
network_grn = grnboost2(
    expression_data = expr_df.values,
    gene_names      = gene_names,
    tf_names        = tf_names,
    seed            = 42
)

# Returns DataFrame with columns: TF, target, importance
print(network_grn.head(10))
network_grn.to_csv("grn_grnboost2.csv", index=False)
Python — Visualize with NetworkX
import networkx as nx
import matplotlib.pyplot as plt

# Keep top 500 edges
top_edges = network_grn.nlargest(500, 'importance')

# Build directed graph
G = nx.DiGraph()
for _, row in top_edges.iterrows():
    G.add_edge(row['TF'], row['target'], weight=row['importance'])

# Color TFs vs targets
node_colors = ['#00e5a0' if n in tf_names else '#00aaff'
                for n in G.nodes()]

pos = nx.spring_layout(G, seed=42, k=2)

plt.figure(figsize=(14, 10), facecolor='#060c14')
nx.draw_networkx(G, pos,
    node_color   = node_colors,
    node_size    = [600 if n in tf_names else 200 for n in G.nodes()],
    font_size    = 8,
    font_color   = 'white',
    edge_color   = '#1a3048',
    arrows       = True,
    arrowsize    = 15)
plt.title("Inferred GRN — GRNBoost2", color='white')
plt.savefig("grn_network.png", dpi=150, bbox_inches='tight')
09

Tips, Pitfalls & Best Practices

What separates a good GRN analysis from a great one

TIP 01

Always provide a TF list

Using all genes as regulators creates noise. Restricting to known TFs improves precision enormously. Use databases like AnimalTFDB, TRRUST, or JASPAR.

TIP 02

More samples = better inference

GENIE3 needs variation to learn from. Aim for ≥50 samples. Combining multiple datasets (batch-corrected) can help greatly.

TIP 03

Validate with known interactions

Check your top-ranked edges against databases like STRING, RegNetwork, or TRRUST. High overlap = good sign.

TIP 04

GENIE3 predicts, not proves

Network edges are predictions. Always validate key regulatory relationships with ChIP-seq, knockdown experiments, or literature evidence.

TIP 05

Try GRNBoost2 for scRNA-seq

For single-cell data, GRNBoost2 is faster and more robust to the sparsity of scRNA-seq matrices.

TIP 06

Consider SCENIC for scRNA-seq

SCENIC builds on GENIE3/GRNBoost2 and adds TF motif analysis. It's the gold standard for single-cell GRN inference.

Common Errors and Solutions

ErrorCauseFix
Error in GENIE3: 'exprMatrix' must be a matrixPassed a data.frameAdd as.matrix() before calling GENIE3
No regulators found in matrixTF names don't match gene namesCheck spelling/case; use intersect()
Very slow executionSingle core, too many genesSet nCores, use treeMethod="ET", filter low-expressed genes
All weights are extremely smallData not normalizedLog-transform or CPM normalize first
Too many edges in networkThreshold too lowUse top 1000 or top 5% percentile cutoff
You did it! You now know how to go from a raw RNA-seq expression matrix all the way to an inferred Gene Regulatory Network with GENIE3. The full pipeline is: normalize → filter → GENIE3 → threshold → visualize → validate.

Next Steps

SCENIC

Single-Cell rEgulatory Network Inference and Clustering — built on GENIE3 with motif analysis.

Cytoscape

Desktop app for publication-quality network visualization. Import your edge list directly.

TRRUST & RegNetwork

Gold-standard databases of known TF–target interactions. Use for validation.

GRNBoost2

A faster GENIE3 variant optimized for large-scale and single-cell expression data.