Scoring Systems Guide

This guide explains how the Molecular AOP Builder calculates pathway suggestions and confidence scores. The system uses a multi-signal hybrid approach combining gene overlap, semantic embeddings, and ontology tag matching to identify and rank relevant pathways.

Three-Signal Hybrid Approach

The pathway suggestion engine combines three independent matching methods to provide robust, biologically meaningful recommendations:

         ┌─────────────────────┐
         │   Key Event Input   │
         │  (Title + Genes)    │
         └──────────┬──────────┘
    ┌───────────────┼───────────────┐
    │               │               │
    ▼               ▼               ▼
┌────────┐  ┌───────────┐  ┌───────────┐
│Gene 35%│  │Semantic50%│  │Ontology15%│
│Overlap │  │BioBERT    │  │Tag Match  │
└───┬────┘  └─────┬─────┘  └─────┬─────┘
    │             │              │
    │             │              │
    └─────────────┼──────────────┘
                  ▼
         ┌─────────────────────┐
         │   Hybrid Combiner   │
         │ + multi-evidence    │
         │     bonus (+5%)     │
         └──────────┬──────────┘
                    ▼
         ┌─────────────────────┐
         │  Ranked Suggestions │
         │  with Confidence    │
         └─────────────────────┘
        

Signal Weights

Method Weight Strength Best For
Gene-Based 35% Biological validation KEs with gene associations
Semantic/BioBERT 50% Conceptual similarity Related but differently named items
Ontology Tags 15% Classification matching Pathways with biological category tags
Why these weights? Extended testing across diverse Key Events showed that semantic matching provides the best balance of coverage and accuracy, while ontology tags add biological classification context:
  • Semantic (50%): Consistently finds relevant pathways even when terminology differs
  • Gene (35%): Provides mechanistic validation through shared gene associations
  • Ontology (15%): Matches pathways by biological category (e.g. apoptosis, signaling)
Multi-Evidence Bonus: When a pathway is suggested by multiple methods, it receives a +5% confidence bonus, reinforcing the finding through independent validation.

Gene-Based Matching (35% Weight)

Gene-based matching queries genes associated with a Key Event from AOP-Wiki, then finds WikiPathways containing those genes.

How It Works

  1. Gene Retrieval: SPARQL query to AOP-Wiki fetches genes linked to the selected KE
  2. Pathway Search: WikiPathways SPARQL finds pathways containing those genes
  3. Confidence Calculation: Combines overlap ratio and pathway specificity

Confidence Formula

overlap_ratio = matching_genes / KE_total_genes specificity = matching_genes / pathway_total_genes confidence = (overlap × 0.4) + (specificity × 0.4) + base_boost

Scoring Examples

KE Genes Matching Pathway Size Confidence Interpretation
5 5/5 50 genes 0.95 Perfect overlap in focused pathway
8 4/8 50 genes 0.67 Partial overlap
1 1/1 100 genes 0.47 Single gene in large pathway
KE Gene Count Penalty: KEs with only 1-2 gene associations receive a 20% penalty (×0.8) because single-gene evidence is insufficient for high confidence.

Ontology Tag Matching (15% Weight)

Ontology-based matching compares biological keywords extracted from the KE title against pathway classification tags from WikiPathways ontology annotations.

How It Works

  1. Keyword Extraction: Biological keywords are extracted from the KE title (e.g. "apoptosis", "CYP2E1", "signaling")
  2. Tag Comparison: Keywords are compared against WikiPathways ontology tags for each pathway
  3. Fuzzy Matching: Both exact and fuzzy string matching are used (fuzzy threshold: 0.85 similarity)

Scoring

Match Type Boost per Match Example
Exact match +0.30 "apoptosis" in KE matches "apoptosis" tag exactly
Fuzzy match +0.15 "signaling" in KE fuzzy-matches "signalling" tag
Score Range: Ontology scores are clamped between a minimum threshold of 0.20 and a maximum of 0.90. Pathways with no matching tags receive a score of 0 and do not contribute to the hybrid score.

Semantic/BioBERT Matching (50% Weight)

BioBERT provides neural network-based semantic understanding, capturing biological meaning beyond simple word matching.

What is BioBERT?

BioBERT is a pre-trained language model specifically designed for biomedical text. It was trained on:

  • PubMed abstracts (4.5 billion words)
  • PMC full-text articles (13.5 billion words)

How Embeddings Work

   "Increase, CYP2E1 expression"
              │
              ▼
    ┌─────────────────┐
    │    BioBERT      │
    │  Neural Network │
    └────────┬────────┘
             │
             ▼
    ┌─────────────────┐
    │  768-dimensional │
    │  embedding vector│
    │  [0.23, -0.15,  │
    │   0.87, ...]    │
    └─────────────────┘
        

Pre-Computed Embeddings

For performance, embeddings are pre-computed and cached:

Data Count File
Key Event embeddings 1,561 ke_embeddings.npy
Pathway title embeddings 1,012 pathway_title_embeddings.npy

Similarity Calculation

cosine_similarity = dot(KE_embedding, pathway_embedding) / (norm(KE) × norm(pathway)) normalized_score = (cosine_similarity + 1) / 2 transformed_score = normalized_score ^ power_exponent

Score Transformation

Raw BioBERT scores tend to cluster in the 0.8-0.95 range because biomedical texts rarely have negative similarity. A power transformation spreads scores for better differentiation:

Raw Cosine Normalized After Transform (^4.0)
0.90 0.95 0.81
0.80 0.90 0.66
0.70 0.85 0.52
0.50 0.75 0.32
Directionality Removal: Before computing embeddings, directional terms (increase, decrease, activation, inhibition) are removed from KE titles. This means "Increase CYP2E1" and "Decrease CYP2E1" will suggest the same pathways.

Confidence Assessment Workflow

When you create a new KE-WP mapping, a 4-question workflow calculates your confidence level:

Question 1: Relationship Type

"What is the relationship between the pathway and Key Event?"

Option Meaning
Causative Pathway activity leads to the Key Event (Pathway → KE)
Responsive Key Event triggers pathway activation (KE → Pathway)
Bidirectional Both directions apply
Unclear Relationship exists but direction uncertain

This question determines the connection_type field but does not affect the confidence score.

Question 2: Evidence Basis (0-3 points)

"What is the basis for this mapping?"

Option Points When to Select
Known connection 3 Published evidence directly supports this mapping
Likely connection 2 Strong inference from your domain knowledge
Possible connection 1 Plausible hypothesis but uncertain
Uncertain connection 0 No clear basis for the connection

Answer based on what you already know. No research is required.

Question 3: Pathway Specificity (0-2 points)

"How specific is the pathway to this Key Event?"

Option Points Example
KE-specific 2 "CYP2E1 metabolism" pathway for KE about CYP2E1
Includes KE 1 "Xenobiotic metabolism" pathway that includes CYP2E1
Loosely related 0 Very broad pathway tangentially related

Question 4: KE Coverage (0-1.5 points)

"How much of the KE mechanism is captured by the pathway?"

Option Points Meaning
Complete mechanism 1.5 Pathway fully represents the KE
Key steps only 1.0 Major elements captured, some missing
Minor aspects 0.5 Only peripheral aspects represented

Biological Level Bonus (+1 point)

KEs at molecular, cellular, or tissue levels receive a +1 bonus because they are closer to pathway mechanisms:

Biological Level Bonus
Molecular, Cellular, Tissue +1.0
Organ, Individual, Population +0.0

Final Score Calculation

Final Score = Evidence (0-3) + Specificity (0-2) + Coverage (0-1.5) + Bio Bonus (0-1) Maximum Score = 7.5 points

Confidence Level Thresholds

Score Range Confidence Level Interpretation
≥ 5.0 High Strong evidence, good specificity and coverage
2.5 - 4.9 Medium Moderate evidence with some limitations
< 2.5 Low Weak evidence or poor pathway fit

Example Calculations

High Confidence Example:

  • Known connection: 3 points
  • KE-specific pathway: 2 points
  • Complete mechanism: 1.5 points
  • Molecular level: +1 bonus
  • Total: 7.5 → High confidence

Medium Confidence Example:

  • Likely connection: 2 points
  • Includes KE: 1 point
  • Key steps only: 1 point
  • Organ level: no bonus
  • Total: 4.0 → Medium confidence

Tuning Parameters

All scoring parameters are configurable via scoring_config.yaml. Changes require a Flask restart to take effect.

Configuration File Location

scoring_config.yaml

Common Tuning Scenarios

Goal Parameter Change
More suggestions dynamic_thresholds.base_threshold Lower (0.15 → 0.10)
Higher gene confidence gene_scoring.base_boost Raise (0.15 → 0.20)
Spread BioBERT scores embedding_based_matching.score_transformation.power_exponent Raise (4.0 → 5.0)
Lenient assessment ke_pathway_assessment.confidence_thresholds.high Lower (5.0 → 4.5)

Applying Changes

# 1. Edit configuration
nano scoring_config.yaml

# 2. Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('scoring_config.yaml'))"

# 3. Restart Flask
pkill -f "python.*app.py" && python app.py &

# 4. Clear browser cache
# Press Ctrl+Shift+R in browser

For detailed parameter documentation, see the SCORING_CONFIG.md reference guide.