BioAtlas Technical Details
A comprehensive technical overview of data integration, normalization pipelines, and the 99.6% dimensional reduction that powers BioAtlas.
Core Innovation: 99.6% Dimensional Reduction
After integration, we solve the interpretation problem. Gene expression data has 60,000+ genes measured per experiment — 99% unchanged or noise. BioAtlas reduces this to biologically meaningful signals.
Network-Based Solution
Result: 60,710 → 256 features (99.6% reduction)
Method: ULM (Univariate Linear Model)
Formula: activity = Σ(expression × sign) / √n_targets
Scope: The Breadth of Integration
Genetics (2.5B+ Variants)
- • GWAS Catalog: 443,634 trait studies
- • UK Biobank: Full summary statistics
- • 80M colocalization tests linking variants to causal genes
- • GTEx: 415K tissue-specific eQTLs
- • Open Targets: 1.14M gene-disease associations
Gene Expression (85M+ Measurements)
- • LINCS L1000: 57.6M gene measurements
- • CELLxGENE: 28.2M single-cell measurements
- • GTEx: 2M tissue expression measurements
- • Tahoe-100M: 1.5M activity scores from 100M cells
Drug Perturbations (204M+ Scores)
- • LINCS L1000: 202M TF/pathway activity scores
- • Tahoe-100M: 1.55M activity scores (DMSO-normalized)
- • SC CRISPR: 1.6M activities from 208K gene knockouts
- • Cross-platform harmonized using same TF/pathway networks
Drug Discovery (25K+ Interactions)
- • ChEMBL: 10K drugs, mechanisms, indications
- • BindingDB: 1.35M binding affinities → pChEMBL unified
- • Drug-Target: 25K interactions from 11 sources
- • Adverse Events: 1.45M records (FDA FAERS)
Normalization: The Cleaning We Did
1. DMSO Cascade Normalization (Tahoe-100M)
Problem: Plate effects, batch effects, vehicle toxicity confound drug signals.
Cascade Matching Strategy: Priority 1: Exact Match (~60% of scores) - Same plate + cell line + time + feature - drug_score - dmso_score = perfect batch correction Priority 2: Plate Match (Additional contexts) - Same plate + feature (different cell/time) - drug_score - mean(plate_dmso) Priority 3: Raw Score (~40% of scores) - No matching DMSO available - Raw score with NO_CONTROL flag
Quality Flags: Every score tagged with control availability. Users can filter to normalized scores for highest confidence.
2. pChEMBL Potency Unification
Problem: Binding data chaos — Ki in nM, Kd in μM, IC50 in various units. Incomparable!
All → pChEMBL = -log10(molar) Measurement Quality Ranking: Ki (equilibrium) > Kd (dissociation) > IC50 (functional) > EC50 (functional) Result: 1,353,181 directly comparable potencies
3. GWAS Harmonization to GRCh38
Problem: Different builds, coordinates, alleles across 443K studies.
1. Coordinate liftover to GRCh38 2. Variant ID standardization: chr:pos:ref:alt 3. Allele harmonization (consistent effect alleles) 4. QC filters: - MAF > 0.01 - INFO > 0.8 - p < 5×10⁻⁸ 5. Ready for colocalization
Result: Clean, queryable, colocalizable genetic data.
4. Cross-Platform Activity Normalization
Problem: LINCS (978 genes), Tahoe (62K genes), CRISPR (variable) — how to compare?
Solution: Apply same networks (DoRothEA + PROGENy) to all platforms.
| Platform | Input Genes | TF Activities | Comparability |
|---|---|---|---|
| LINCS L1000 | 978 | 202M scores | ✓ Same TFs/pathways |
| Tahoe-100M | 62,710 | 1.2M scores | ✓ Same TFs/pathways |
| SC CRISPR | 8-23K | 1.6M scores | ✓ Same TFs/pathways |
Result: TP53 activity in LINCS comparable to TP53 in Tahoe — cross-platform validation!
What Only BioAtlas Has
80 Million Colocalization Tests
GWAS identifies variant-disease associations, but which gene is causal? BioAtlas provides 79,926,756 Bayesian colocalization tests (GWAS × eQTL).
- • 63,332,122 with strong evidence (H4 > 0.8)
- • 70% have very strong evidence (H4 > 0.9)
- • Precomputed (weeks of analysis saved)
Multi-Source Merges
Not separate downloads — unified, deduplicated, evidence-weighted:
- • Adverse Events: FDA_FAERS = 1.45M
- • Ligand-Receptor: CellPhoneDB + CellChatDB = 20.9M
- • Drug-Target: ChEMBL + 10 sources = 25K
Cross-Platform Harmonization
Same TF/pathway framework across all perturbation datasets:
- • Compare LINCS drug → Tahoe drug → CRISPR KO
- • Validate mechanisms across platforms
- • 204.5M scores using consistent networks
Complete Local SQL Access
No APIs, no rate limits, no internet required:
- • Download once, query forever
- • Full PostgreSQL power
- • Multi-hop joins across all 40+ sources
Usage Examples
Find Drugs Targeting Genetically Validated Genes
-- Find FDA-approved drugs targeting genes with strong
-- genetic evidence for Alzheimer's disease
SELECT DISTINCT
d.drug_name,
d.max_phase,
g.hgnc_symbol,
dg.association_score,
dt.action_type
FROM drug d
JOIN drug_target dt ON d.molecule_chembl_id = dt.molecule_chembl_id
JOIN gene g ON dt.ensembl_gene_id = g.ensembl_gene_id
JOIN disease_gene dg ON g.ensembl_gene_id = dg.ensembl_gene_id
JOIN disease dis ON dg.mondo_id = dis.mondo_id
WHERE dis.disease_label ILIKE '%Alzheimer%'
AND d.fda_approved_us = TRUE
AND dg.association_score > 0.5
ORDER BY dg.association_score DESC;Drug Mechanism-of-Action Analysis
-- What pathways does Paclitaxel activate across cell lines?
SELECT
tc.cell_line_name,
ta.feature_id as pathway,
AVG(ta.score) as mean_activity,
COUNT(*) as n_measurements
FROM tahoe_context tc
JOIN tahoe_activity ta ON tc.context_id = ta.context_id
WHERE tc.drug_name = 'Paclitaxel'
AND ta.provider = 'PROGENy'
AND ta.feature_type = 'pathway'
GROUP BY tc.cell_line_name, ta.feature_id
HAVING ABS(AVG(ta.score)) > 2.0
ORDER BY ABS(AVG(ta.score)) DESC;Colocalization-Based Drug Repurposing
-- Find drugs targeting genes with strong colocalization
-- evidence for inflammatory bowel disease
SELECT DISTINCT
d.drug_name,
g.hgnc_symbol,
c.h4 as colocalization_prob,
dt.action_type
FROM coloc_bayesian c
JOIN gene g ON c."rightStudyLocusId" LIKE '%' || g.ensembl_gene_id || '%'
JOIN drug_target dt ON g.ensembl_gene_id = dt.ensembl_gene_id
JOIN drug d ON dt.molecule_chembl_id = d.molecule_chembl_id
WHERE c."leftStudyLocusId" LIKE '%inflammatory_bowel%'
AND c.h4 > 0.9
AND d.max_phase >= 2
LIMIT 20;Ready to Query Across 490M+ Rows?
Download BioAtlas and start discovering drug-disease-mechanism relationships in seconds.