Scoring Systems Guide
This guide explains how the Molecular AOP Builder calculates pathway suggestions and confidence scores. The system uses a multi-signal hybrid approach combining gene overlap, semantic embeddings, and ontology tag matching to identify and rank relevant pathways.
Three-Signal Hybrid Approach
The pathway suggestion engine combines three independent matching methods to provide robust, biologically meaningful recommendations:
┌─────────────────────┐
│ Key Event Input │
│ (Title + Genes) │
└──────────┬──────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌───────────┐ ┌───────────┐
│Gene 35%│ │Semantic50%│ │Ontology15%│
│Overlap │ │BioBERT │ │Tag Match │
└───┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
│ │ │
└─────────────┼──────────────┘
▼
┌─────────────────────┐
│ Hybrid Combiner │
│ + multi-evidence │
│ bonus (+5%) │
└──────────┬──────────┘
▼
┌─────────────────────┐
│ Ranked Suggestions │
│ with Confidence │
└─────────────────────┘
Signal Weights
| Method | Weight | Strength | Best For |
|---|---|---|---|
| Gene-Based | 35% | Biological validation | KEs with gene associations |
| Semantic/BioBERT | 50% | Conceptual similarity | Related but differently named items |
| Ontology Tags | 15% | Classification matching | Pathways with biological category tags |
- Semantic (50%): Consistently finds relevant pathways even when terminology differs
- Gene (35%): Provides mechanistic validation through shared gene associations
- Ontology (15%): Matches pathways by biological category (e.g. apoptosis, signaling)
Gene-Based Matching (35% Weight)
Gene-based matching queries genes associated with a Key Event from AOP-Wiki, then finds WikiPathways containing those genes.
How It Works
- Gene Retrieval: SPARQL query to AOP-Wiki fetches genes linked to the selected KE
- Pathway Search: WikiPathways SPARQL finds pathways containing those genes
- Confidence Calculation: Combines overlap ratio and pathway specificity
Confidence Formula
overlap_ratio = matching_genes / KE_total_genes
specificity = matching_genes / pathway_total_genes
confidence = (overlap × 0.4) + (specificity × 0.4) + base_boost
Scoring Examples
| KE Genes | Matching | Pathway Size | Confidence | Interpretation |
|---|---|---|---|---|
| 5 | 5/5 | 50 genes | 0.95 | Perfect overlap in focused pathway |
| 8 | 4/8 | 50 genes | 0.67 | Partial overlap |
| 1 | 1/1 | 100 genes | 0.47 | Single gene in large pathway |
Ontology Tag Matching (15% Weight)
Ontology-based matching compares biological keywords extracted from the KE title against pathway classification tags from WikiPathways ontology annotations.
How It Works
- Keyword Extraction: Biological keywords are extracted from the KE title (e.g. "apoptosis", "CYP2E1", "signaling")
- Tag Comparison: Keywords are compared against WikiPathways ontology tags for each pathway
- Fuzzy Matching: Both exact and fuzzy string matching are used (fuzzy threshold: 0.85 similarity)
Scoring
| Match Type | Boost per Match | Example |
|---|---|---|
| Exact match | +0.30 | "apoptosis" in KE matches "apoptosis" tag exactly |
| Fuzzy match | +0.15 | "signaling" in KE fuzzy-matches "signalling" tag |
Semantic/BioBERT Matching (50% Weight)
BioBERT provides neural network-based semantic understanding, capturing biological meaning beyond simple word matching.
What is BioBERT?
BioBERT is a pre-trained language model specifically designed for biomedical text. It was trained on:
- PubMed abstracts (4.5 billion words)
- PMC full-text articles (13.5 billion words)
How Embeddings Work
"Increase, CYP2E1 expression"
│
▼
┌─────────────────┐
│ BioBERT │
│ Neural Network │
└────────┬────────┘
│
▼
┌─────────────────┐
│ 768-dimensional │
│ embedding vector│
│ [0.23, -0.15, │
│ 0.87, ...] │
└─────────────────┘
Pre-Computed Embeddings
For performance, embeddings are pre-computed and cached:
| Data | Count | File |
|---|---|---|
| Key Event embeddings | 1,561 | ke_embeddings.npy |
| Pathway title embeddings | 1,012 | pathway_title_embeddings.npy |
Similarity Calculation
cosine_similarity = dot(KE_embedding, pathway_embedding) /
(norm(KE) × norm(pathway))
normalized_score = (cosine_similarity + 1) / 2
transformed_score = normalized_score ^ power_exponent
Score Transformation
Raw BioBERT scores tend to cluster in the 0.8-0.95 range because biomedical texts rarely have negative similarity. A power transformation spreads scores for better differentiation:
| Raw Cosine | Normalized | After Transform (^4.0) |
|---|---|---|
| 0.90 | 0.95 | 0.81 |
| 0.80 | 0.90 | 0.66 |
| 0.70 | 0.85 | 0.52 |
| 0.50 | 0.75 | 0.32 |
Confidence Assessment Workflow
When you create a new KE-WP mapping, a 4-question workflow calculates your confidence level:
Question 1: Relationship Type
"What is the relationship between the pathway and Key Event?"
| Option | Meaning |
|---|---|
| Causative | Pathway activity leads to the Key Event (Pathway → KE) |
| Responsive | Key Event triggers pathway activation (KE → Pathway) |
| Bidirectional | Both directions apply |
| Unclear | Relationship exists but direction uncertain |
This question determines the connection_type field but does not affect the confidence score.
Question 2: Evidence Basis (0-3 points)
"What is the basis for this mapping?"
| Option | Points | When to Select |
|---|---|---|
| Known connection | 3 | Published evidence directly supports this mapping |
| Likely connection | 2 | Strong inference from your domain knowledge |
| Possible connection | 1 | Plausible hypothesis but uncertain |
| Uncertain connection | 0 | No clear basis for the connection |
Answer based on what you already know. No research is required.
Question 3: Pathway Specificity (0-2 points)
"How specific is the pathway to this Key Event?"
| Option | Points | Example |
|---|---|---|
| KE-specific | 2 | "CYP2E1 metabolism" pathway for KE about CYP2E1 |
| Includes KE | 1 | "Xenobiotic metabolism" pathway that includes CYP2E1 |
| Loosely related | 0 | Very broad pathway tangentially related |
Question 4: KE Coverage (0-1.5 points)
"How much of the KE mechanism is captured by the pathway?"
| Option | Points | Meaning |
|---|---|---|
| Complete mechanism | 1.5 | Pathway fully represents the KE |
| Key steps only | 1.0 | Major elements captured, some missing |
| Minor aspects | 0.5 | Only peripheral aspects represented |
Biological Level Bonus (+1 point)
KEs at molecular, cellular, or tissue levels receive a +1 bonus because they are closer to pathway mechanisms:
| Biological Level | Bonus |
|---|---|
| Molecular, Cellular, Tissue | +1.0 |
| Organ, Individual, Population | +0.0 |
Final Score Calculation
Final Score = Evidence (0-3) + Specificity (0-2) + Coverage (0-1.5) + Bio Bonus (0-1)
Maximum Score = 7.5 points
Confidence Level Thresholds
| Score Range | Confidence Level | Interpretation |
|---|---|---|
| ≥ 5.0 | High | Strong evidence, good specificity and coverage |
| 2.5 - 4.9 | Medium | Moderate evidence with some limitations |
| < 2.5 | Low | Weak evidence or poor pathway fit |
Example Calculations
High Confidence Example:
- Known connection: 3 points
- KE-specific pathway: 2 points
- Complete mechanism: 1.5 points
- Molecular level: +1 bonus
- Total: 7.5 → High confidence
Medium Confidence Example:
- Likely connection: 2 points
- Includes KE: 1 point
- Key steps only: 1 point
- Organ level: no bonus
- Total: 4.0 → Medium confidence
Tuning Parameters
All scoring parameters are configurable via scoring_config.yaml. Changes require a Flask restart to take effect.
Configuration File Location
scoring_config.yaml
Common Tuning Scenarios
| Goal | Parameter | Change |
|---|---|---|
| More suggestions | dynamic_thresholds.base_threshold |
Lower (0.15 → 0.10) |
| Higher gene confidence | gene_scoring.base_boost |
Raise (0.15 → 0.20) |
| Spread BioBERT scores | embedding_based_matching.score_transformation.power_exponent |
Raise (4.0 → 5.0) |
| Lenient assessment | ke_pathway_assessment.confidence_thresholds.high |
Lower (5.0 → 4.5) |
Applying Changes
# 1. Edit configuration
nano scoring_config.yaml
# 2. Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('scoring_config.yaml'))"
# 3. Restart Flask
pkill -f "python.*app.py" && python app.py &
# 4. Clear browser cache
# Press Ctrl+Shift+R in browser
For detailed parameter documentation, see the SCORING_CONFIG.md reference guide.