Technical Documentation

BioAtlas Technical Details

A comprehensive technical overview of data integration, normalization pipelines, and the 99.6% dimensional reduction that powers BioAtlas.

Core Innovation: 99.6% Dimensional Reduction

After integration, we solve the interpretation problem. Gene expression data has 60,000+ genes measured per experiment — 99% unchanged or noise. BioAtlas reduces this to biologically meaningful signals.

Network-Based Solution

Gene Expression60,710 dimensions (Noisy)
Filter to network genes~6,000 relevant
Apply DoRothEA242 TF Activities
Apply PROGENy14 Pathway Activities

Result: 60,710 → 256 features (99.6% reduction)
Method: ULM (Univariate Linear Model)
Formula: activity = Σ(expression × sign) / √n_targets

Scope: The Breadth of Integration

Genetics (2.5B+ Variants)

  • GWAS Catalog: 443,634 trait studies
  • UK Biobank: Full summary statistics
  • 80M colocalization tests linking variants to causal genes
  • GTEx: 415K tissue-specific eQTLs
  • Open Targets: 1.14M gene-disease associations

Gene Expression (85M+ Measurements)

  • LINCS L1000: 57.6M gene measurements
  • CELLxGENE: 28.2M single-cell measurements
  • GTEx: 2M tissue expression measurements
  • Tahoe-100M: 1.5M activity scores from 100M cells

Drug Perturbations (204M+ Scores)

  • LINCS L1000: 202M TF/pathway activity scores
  • Tahoe-100M: 1.55M activity scores (DMSO-normalized)
  • SC CRISPR: 1.6M activities from 208K gene knockouts
  • Cross-platform harmonized using same TF/pathway networks

Drug Discovery (25K+ Interactions)

  • ChEMBL: 10K drugs, mechanisms, indications
  • BindingDB: 1.35M binding affinities → pChEMBL unified
  • Drug-Target: 25K interactions from 11 sources
  • Adverse Events: 1.45M records (FDA FAERS)

Normalization: The Cleaning We Did

1. DMSO Cascade Normalization (Tahoe-100M)

Problem: Plate effects, batch effects, vehicle toxicity confound drug signals.

Cascade Matching Strategy:

Priority 1: Exact Match (~60% of scores)
  - Same plate + cell line + time + feature
  - drug_score - dmso_score = perfect batch correction

Priority 2: Plate Match (Additional contexts)
  - Same plate + feature (different cell/time)
  - drug_score - mean(plate_dmso)

Priority 3: Raw Score (~40% of scores)
  - No matching DMSO available
  - Raw score with NO_CONTROL flag

Quality Flags: Every score tagged with control availability. Users can filter to normalized scores for highest confidence.

2. pChEMBL Potency Unification

Problem: Binding data chaos — Ki in nM, Kd in μM, IC50 in various units. Incomparable!

All → pChEMBL = -log10(molar)

Measurement Quality Ranking:
Ki (equilibrium) > Kd (dissociation) > IC50 (functional) > EC50 (functional)

Result: 1,353,181 directly comparable potencies

3. GWAS Harmonization to GRCh38

Problem: Different builds, coordinates, alleles across 443K studies.

1. Coordinate liftover to GRCh38
2. Variant ID standardization: chr:pos:ref:alt
3. Allele harmonization (consistent effect alleles)
4. QC filters:
   - MAF > 0.01
   - INFO > 0.8
   - p < 5×10⁻⁸
5. Ready for colocalization

Result: Clean, queryable, colocalizable genetic data.

4. Cross-Platform Activity Normalization

Problem: LINCS (978 genes), Tahoe (62K genes), CRISPR (variable) — how to compare?

Solution: Apply same networks (DoRothEA + PROGENy) to all platforms.

PlatformInput GenesTF ActivitiesComparability
LINCS L1000978202M scores✓ Same TFs/pathways
Tahoe-100M62,7101.2M scores✓ Same TFs/pathways
SC CRISPR8-23K1.6M scores✓ Same TFs/pathways

Result: TP53 activity in LINCS comparable to TP53 in Tahoe — cross-platform validation!

What Only BioAtlas Has

80 Million Colocalization Tests

GWAS identifies variant-disease associations, but which gene is causal? BioAtlas provides 79,926,756 Bayesian colocalization tests (GWAS × eQTL).

  • • 63,332,122 with strong evidence (H4 > 0.8)
  • • 70% have very strong evidence (H4 > 0.9)
  • • Precomputed (weeks of analysis saved)

Multi-Source Merges

Not separate downloads — unified, deduplicated, evidence-weighted:

  • Adverse Events: FDA_FAERS = 1.45M
  • Ligand-Receptor: CellPhoneDB + CellChatDB = 20.9M
  • Drug-Target: ChEMBL + 10 sources = 25K

Cross-Platform Harmonization

Same TF/pathway framework across all perturbation datasets:

  • • Compare LINCS drug → Tahoe drug → CRISPR KO
  • • Validate mechanisms across platforms
  • • 204.5M scores using consistent networks

Complete Local SQL Access

No APIs, no rate limits, no internet required:

  • • Download once, query forever
  • • Full PostgreSQL power
  • • Multi-hop joins across all 40+ sources

Usage Examples

Find Drugs Targeting Genetically Validated Genes

-- Find FDA-approved drugs targeting genes with strong 
-- genetic evidence for Alzheimer's disease
SELECT DISTINCT
    d.drug_name,
    d.max_phase,
    g.hgnc_symbol,
    dg.association_score,
    dt.action_type
FROM drug d
JOIN drug_target dt ON d.molecule_chembl_id = dt.molecule_chembl_id
JOIN gene g ON dt.ensembl_gene_id = g.ensembl_gene_id
JOIN disease_gene dg ON g.ensembl_gene_id = dg.ensembl_gene_id
JOIN disease dis ON dg.mondo_id = dis.mondo_id
WHERE dis.disease_label ILIKE '%Alzheimer%'
  AND d.fda_approved_us = TRUE
  AND dg.association_score > 0.5
ORDER BY dg.association_score DESC;

Drug Mechanism-of-Action Analysis

-- What pathways does Paclitaxel activate across cell lines?
SELECT 
    tc.cell_line_name,
    ta.feature_id as pathway,
    AVG(ta.score) as mean_activity,
    COUNT(*) as n_measurements
FROM tahoe_context tc
JOIN tahoe_activity ta ON tc.context_id = ta.context_id
WHERE tc.drug_name = 'Paclitaxel'
  AND ta.provider = 'PROGENy'
  AND ta.feature_type = 'pathway'
GROUP BY tc.cell_line_name, ta.feature_id
HAVING ABS(AVG(ta.score)) > 2.0
ORDER BY ABS(AVG(ta.score)) DESC;

Colocalization-Based Drug Repurposing

-- Find drugs targeting genes with strong colocalization 
-- evidence for inflammatory bowel disease
SELECT DISTINCT
    d.drug_name,
    g.hgnc_symbol,
    c.h4 as colocalization_prob,
    dt.action_type
FROM coloc_bayesian c
JOIN gene g ON c."rightStudyLocusId" LIKE '%' || g.ensembl_gene_id || '%'
JOIN drug_target dt ON g.ensembl_gene_id = dt.ensembl_gene_id
JOIN drug d ON dt.molecule_chembl_id = d.molecule_chembl_id
WHERE c."leftStudyLocusId" LIKE '%inflammatory_bowel%'
  AND c.h4 > 0.9
  AND d.max_phase >= 2
LIMIT 20;

Ready to Query Across 490M+ Rows?

Download BioAtlas and start discovering drug-disease-mechanism relationships in seconds.