Part 2: Building a Production-Ready Medical NER Pipeline

How I built a hybrid NER system that achieves 95%+ accuracy on clinical text by combining BioBERT models with 57,476 curated medical templates

This is Part 2 of a two-part series on Medical Named Entity Recognition

Part Focus Audience

Part 1: Clinical Insights Clinical use cases, patient care impact, business value Clinical Researchers, Product Owners, Delivery Managers

Part 2 (You are here) Architecture, algorithms, code implementation Developers, ML Engineers, Data Scientists

New to Medical NER? Start with Part 1 for the clinical context and use cases.

Part	Focus	Audience
Part 1: Clinical Insights	Clinical use cases, patient care impact, business value	Clinical Researchers, Product Owners, Delivery Managers
Part 2 (You are here)	Architecture, algorithms, code implementation	Developers, ML Engineers, Data Scientists

Introduction

Clinical notes contain a wealth of information locked in unstructured text. A patient’s electronic health record might read:

“Patient denies chest pain but reports shortness of breath. History of diabetes mellitus type 2, currently on metformin. Mother has breast cancer.”

To a clinician, this single paragraph conveys critical information: the patient is NOT experiencing chest pain (negated), IS experiencing shortness of breath (confirmed), HAS a history of diabetes (historical), and has a family history of cancer (family). But to a computer? It’s just a string of characters.

This blog post documents my journey building a production-ready Medical Named Entity Recognition (NER) pipeline that extracts and classifies these clinical insights with high accuracy.

The Problem Space

Why Clinical NLP is Hard

Standard NLP tools struggle with medical text for several reasons:

Domain-specific vocabulary: Medical terms like “dyspnea,” “myocardial infarction,” and “KIF5A” aren’t in general-purpose NLP models
Context matters enormously: “No fever” vs “has fever” - one word changes everything
Complex negation patterns: “Patient denies fever but reports cough” requires understanding scope reversal
Abbreviations and variations: “DM2,” “T2DM,” “diabetes mellitus type 2” all mean the same thing

The Stakes

Misclassifying a medical entity isn’t just an academic problem. Marking “chest pain” as “present” when it was actually “denied” could have serious implications for clinical decision support systems, research cohort identification, and automated coding.

Architecture Overview

I designed a 5-stage hybrid processing pipeline that combines the power of transformer-based models with the precision of curated medical templates.

graph TB
    subgraph Input
        A[Clinical Text]
    end

    subgraph "Stage 1: Base NLP"
        B[Tokenization]
        C[Sentence Segmentation]
        D[POS Tagging]
    end

    subgraph "Stage 2: Entity Extraction"
        E[BioBERT Disease]
        F[BioBERT Chemical]
        G[BioBERT Gene]
        H[Template Matching<br/>57,476 terms]
    end

    subgraph "Stage 3: Context Classification"
        I[Confirmed<br/>138 patterns]
        J[Negated<br/>99 patterns]
        K[Uncertain<br/>48 patterns]
        L[Historical<br/>82 patterns]
        M[Family<br/>79 patterns]
    end

    subgraph "Stage 4: Section Detection"
        N[20+ Clinical Sections]
    end

    subgraph "Stage 5: Output"
        O[15-Column Excel]
        P[Streamlit UI]
        Q[JSON Export]
    end

    A --> B --> C --> D
    D --> E & F & G & H
    E & F & G & H --> I & J & K & L & M
    I & J & K & L & M --> N
    N --> O & P & Q

Why Hybrid?

Pure machine learning approaches have limitations in specialized domains:

Approach	Pros	Cons
BioBERT Only	High accuracy on common entities	Misses rare diseases, custom terms
Rules Only	Complete control, no training needed	Brittle, hard to maintain
Hybrid (This Project)	Best of both worlds	More complex to implement

The hybrid approach lets BioBERT handle common patterns while curated templates catch domain-specific edge cases.

Key Innovation #1: Scope Reversal Detection

The “But” Problem

Consider this sentence:

“Patient denies fever but reports cough and fatigue”

A naive negation detector would see “denies” and mark everything after it as negated. But that’s wrong! The word “but” reverses the scope:

fever → NEGATED (before “but”)
cough → CONFIRMED (after “but”)
fatigue → CONFIRMED (after “but”)

103 Reversal Patterns

I implemented a comprehensive scope reversal engine with 103 patterns covering:

Adversative Conjunctions (Confidence: 90-95%):

"denies X but reports Y"
"no X however shows Y"
"not X yet has Y"
"without X but demonstrates Y"

Temporal Transitions (Confidence: 85-92%):

"denies X but now reports Y"
"no X but currently shows Y"

Exception Patterns (Confidence: 80-90%):

"denies all X except Y"
"no X save for Y"
"not X apart from Y"

Implementation

scope_reversal_patterns = {
    'negation_to_confirmation_adversative': {
        'pattern': r'(denies?|no|not|without|absent)\s+([^.!?]*?)\s+(but|however|yet)\s+(reports?|shows?|has|demonstrates?)',
        'scope_before': 'NEGATED',
        'scope_after': 'CONFIRMED',
        'confidence': 0.95,
        'priority': 10
    },
    # ... 102 more patterns
}

def detect_scope_reversal(text, entities):
    """
    Detect conjunction and assign appropriate context to entities
    based on their position relative to the reversal point.
    """
    for pattern_name, config in scope_reversal_patterns.items():
        match = re.search(config['pattern'], text, re.IGNORECASE)
        if match:
            conjunction_pos = match.start(3)  # Position of "but/however/yet"

            for entity in entities:
                if entity['end'] < conjunction_pos:
                    entity['context'] = config['scope_before']
                else:
                    entity['context'] = config['scope_after']

    return entities

Results

Metric	Without Scope Reversal	With Scope Reversal
Context Accuracy	85%	93%
False Positives	8%	3%
False Negatives	10%	5%

Key Innovation #2: Template-Priority Mode

The Template System

The heart of precision in this pipeline is a curated template system with 57,476 medical terms:

Template File	Pattern Count	Purpose
`target_rules_template.xlsx`	57,476	Medical terms (diseases, genes, drugs)
`confirmed_rules_template.xlsx`	138	Confirmation patterns
`negated_rules_template.xlsx`	99	Negation patterns
`uncertainty_rules_template.xlsx`	48	Uncertainty patterns
`historical_rules_template.xlsx`	82	Historical context patterns
`family_rules_template.xlsx`	79	Family history patterns

Template-Priority vs Confidence-Based

I implemented two override strategies:

Confidence-Based (Original):

if template_confidence > biobert_confidence:
    use_template_detection()
else:
    use_biobert_detection()

Template-Priority (Default in v2.3.0):

if template_match:
    use_template_detection()  # ALWAYS override

Template-priority mode is crucial for:

Rare diseases not in BioBERT’s training data
Institution-specific terminology
Custom clinical vocabularies
Quality control over entity boundaries

Python API Example

from src.enhanced_medical_ner_predictor import MedicalNERPredictor

# Initialize with template-priority (default)
predictor = MedicalNERPredictor(template_priority=True)

# Process clinical text
text = """
Patient is a 65-year-old male presenting with chest pain.
Denies fever, chills, or night sweats but reports fatigue.
Past medical history significant for Type 2 diabetes mellitus,
hypertension, and coronary artery disease.
Mother has history of breast cancer.
Possible pneumonia on chest X-ray.
"""

results = predictor.process_text(text)

# Access structured results
print("Confirmed entities:", results['confirmed_entities'])
print("Negated entities:", results['negated_entities'])
print("Historical entities:", results['historical_entities'])
print("Family history:", results['family_entities'])
print("Uncertain entities:", results['uncertain_entities'])

Output:

Confirmed entities: ['chest pain', 'fatigue']
Negated entities: ['fever', 'chills', 'night sweats']
Historical entities: ['Type 2 diabetes mellitus', 'hypertension', 'coronary artery disease']
Family history: ['breast cancer']
Uncertain entities: ['pneumonia']

Key Innovation #3: Rich Context Classification

Five Context Types

Every detected entity is classified into exactly one context category:

mindmap
  root((Entity<br/>Context))
    Confirmed
      diagnosed with
      presents with
      positive for
      exhibits
    Negated
      denies
      no evidence of
      negative for
      absence of
    Uncertain
      possible
      rule out
      suspected
      cannot exclude
    Historical
      history of
      previous
      past medical
      prior
    Family
      mother has
      father with
      family history
      hereditary

Priority Hierarchy

When multiple contexts could apply, I use a strict priority hierarchy:

NEGATED (highest) - Absence is most important clinically
FAMILY - Family history is distinct from patient history
HISTORICAL - Past conditions differ from current
UNCERTAIN - Speculative findings need flagging
CONFIRMED (default) - Present/active conditions

Confidence Scoring Algorithm

def calculate_confidence(pattern_match, entity_position):
    """
    Multi-factor confidence scoring (0-100%).

    Total = Strength Points + Proximity Points + Structure Points
    """
    # Strength Points (max 40)
    strength_map = {'strong': 40, 'moderate': 30, 'weak': 20}
    strength_points = strength_map.get(pattern_match.strength, 20)

    # Proximity Points (max 40)
    distance = abs(pattern_match.position - entity_position)
    if distance <= 5:
        proximity_points = 40
    elif distance <= 10:
        proximity_points = 35
    elif distance <= 20:
        proximity_points = 25
    elif distance <= 35:
        proximity_points = 15
    else:
        proximity_points = 5

    # Structure Points (max 20)
    structure_points = 20 if pattern_match.same_sentence else 10

    return strength_points + proximity_points + structure_points

The Processing Pipeline

Stage 1: Base NLP Processing

Using spaCy for foundational NLP:

import spacy

nlp = spacy.load("en_core_web_sm")

def base_nlp_processing(text):
    doc = nlp(text)
    return {
        'tokens': [token.text for token in doc],
        'sentences': [sent.text for sent in doc.sents],
        'pos_tags': [(token.text, token.pos_) for token in doc],
        'dependencies': [(token.text, token.dep_, token.head.text) for token in doc]
    }

Stage 2: Hybrid Entity Extraction

Three BioBERT models run in parallel:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

class BioBERTExtractor:
    def __init__(self):
        self.models = {
            'disease': self._load_model('models/pretrained/Disease'),
            'chemical': self._load_model('models/pretrained/Chemical'),
            'gene': self._load_model('models/pretrained/Gene')
        }

    def _load_model(self, path):
        tokenizer = AutoTokenizer.from_pretrained(path)
        model = AutoModelForTokenClassification.from_pretrained(path)
        return {'tokenizer': tokenizer, 'model': model}

    def extract_entities(self, text):
        all_entities = []

        for entity_type, model_dict in self.models.items():
            entities = self._run_inference(text, model_dict, entity_type)
            all_entities.extend(entities)

        # Apply template boosting
        template_entities = self.template_matcher.find_matches(text)
        all_entities = self._merge_entities(all_entities, template_entities)

        return all_entities

Stage 3: Context Classification

For each entity, analyze ±200 character window:

def classify_context(entity, text, context_window=200):
    """
    Classify entity context using pattern matching and confidence scoring.
    """
    # Extract context window
    start = max(0, entity['start'] - context_window)
    end = min(len(text), entity['end'] + context_window)
    context = text[start:end]

    # Check each context type in priority order
    context_scores = {}

    for context_type in ['negated', 'family', 'historical', 'uncertain', 'confirmed']:
        patterns = load_patterns(f'{context_type}_rules_template.xlsx')

        for pattern in patterns:
            match = re.search(pattern['regex'], context, re.IGNORECASE)
            if match:
                confidence = calculate_confidence(match, entity['start'] - start)
                if confidence > context_scores.get(context_type, 0):
                    context_scores[context_type] = confidence

    # Apply scope reversal detection
    context_scores = apply_scope_reversal(context, entity, context_scores)

    # Return highest priority context above threshold
    for context_type in ['negated', 'family', 'historical', 'uncertain', 'confirmed']:
        threshold = THRESHOLDS[context_type]
        if context_scores.get(context_type, 0) >= threshold:
            return context_type, context_scores[context_type]

    return 'confirmed', 60  # Default

Stage 4: Section Detection

Detect clinical note sections:

SECTION_PATTERNS = {
    'Chief Complaint': [r'\bchief complaint\b', r'\bcc\b:'],
    'HPI': [r'\bhistory of present illness\b', r'\bhpi\b:'],
    'Assessment': [r'\bassessment\b:', r'\bimpression\b:'],
    'Plan': [r'\bplan\b:', r'\brecommendations\b:'],
    # ... 20+ section types
}

def detect_sections(text):
    sections = []
    for section_name, patterns in SECTION_PATTERNS.items():
        for pattern in patterns:
            if re.search(pattern, text, re.IGNORECASE):
                sections.append(section_name)
                break
    return sections

Stage 5: Output Generation

Generate comprehensive 15-column Excel output:

Column	Description
Text Visualization	HTML with color-coded entities
detected_diseases	Disease entities found
total_diseases_count	Count of diseases
detected_genes	Gene/protein entities
total_gene_count	Count of genes
negated_entities	Conditions explicitly denied
negated_entities_count	Count
historical_entities	Past medical history
historical_entities_count	Count
uncertain_entities	Possible/suspected conditions
uncertain_entities_count	Count
family_entities	Family medical history
family_entities_count	Count
section_categories	Clinical sections detected
all_entities_json	Complete structured JSON

Performance Results

Accuracy Metrics

Metric	BioBERT Only	+ Templates	+ Scope Reversal
Entity Detection	92%	96%	96%
Context Classification	85%	88%	93%
False Positive Rate	8%	4%	3%
False Negative Rate	10%	6%	5%

Processing Speed

Input Size	Processing Time	Throughput
10 rows	15 seconds	0.67 rows/sec
100 rows	2 minutes	0.83 rows/sec
1000 rows	18 minutes	0.93 rows/sec

Memory usage stays around ~2GB for typical workloads.

Interactive Demo

The pipeline includes a Streamlit web application for interactive use:

# Launch the Streamlit app
./run_app.sh

# Opens at http://localhost:8501

Features:

File upload (Excel, CSV, TXT)
Manual text input
Real-time processing
Color-coded entity visualization
Export to Excel/JSON
Configurable detection options

Streamlit App Interface

Medical NER Main UI The Enhanced Bio-NER Entity Visualizer with configuration panel and text input

Entity Visualization with Context

Entity Visualization Color-coded entity highlighting showing Disease (red), Drug (blue) with context icons: ❌ Negated, 📅 Historical, 👨‍👩‍👧 Family

Context Classification Details

Detailed context classification showing predictor patterns (DENIES, NO, NO EVIDENCE OF, HISTORY OF, MOTHER) that triggered each classification

Technology Stack

mindmap
  root((Medical NER<br/>Technology Stack))
    NLP Foundation
      spaCy 3.7+
      scispaCy
      negspacy
    Deep Learning
      PyTorch 2.0+
      Hugging Face Transformers
      BioBERT NER Models
    User Interface
      Streamlit 1.28+
    Data Processing
      Pandas
      OpenPyXL
      JSON

Lessons Learned

1. Medical NLP is Hard

Standard NLP tools fail spectacularly on clinical text. Domain-specific vocabulary, complex negation, and abbreviations require specialized handling.

2. Hybrid > Pure ML for Specialized Domains

Combining transformer models with curated templates yields better results than either approach alone. BioBERT provides broad coverage; templates ensure precision.

3. Template Curation is Underrated

The 57,476 curated medical terms took significant effort to compile, but they’re essential for catching rare diseases and domain-specific terminology that ML models miss.

4. Context is Everything

Entity extraction is only half the battle. Correctly classifying whether an entity is confirmed, negated, historical, or uncertain is equally important for clinical applications.

5. Scope Reversal is Non-Negotiable

The “but” problem affects real clinical notes frequently. Implementing proper scope reversal detection improved context accuracy by 8 percentage points.

Future Roadmap

Additional entity types: Symptoms, procedures, lab values
Multi-language support: Spanish, French clinical notes
FHIR integration: Export to standard healthcare formats
Active learning: Improve templates based on user corrections
GPU optimization: Faster inference for large-scale processing

Conclusion

Building a production-ready medical NER pipeline requires more than just throwing a transformer model at the problem. The hybrid approach - combining BioBERT’s contextual understanding with curated templates and sophisticated context classification - achieves the accuracy needed for real clinical applications.

Key Takeaways:

Use domain-specific models (BioBERT, not general BERT)
Augment ML with curated templates for edge cases
Implement proper scope reversal detection
Classify context, not just entities
Validate extensively with clinical experts

Series Navigation

This was Part 2 of the Medical NER Pipeline series.

Missed Part 1?

Part 1: Unlocking Hidden Insights in Clinical Notes covers:

Topic What You’ll Learn

Patient Care Use Cases How the system improves research cohort identification and care gaps

Clinical Context Why “denies chest pain” vs “has chest pain” matters clinically

Real-World Results Demo findings from 100 rare disease patient records

Implementation Planning Guidance for product owners and delivery managers

Business Value From weeks to minutes for cohort identification

Part 1 is written for clinical researchers, product owners, and delivery managers.

← Read Part 1: Clinical Insights

Topic	What You’ll Learn
Patient Care Use Cases	How the system improves research cohort identification and care gaps
Clinical Context	Why “denies chest pain” vs “has chest pain” matters clinically
Real-World Results	Demo findings from 100 rare disease patient records
Implementation Planning	Guidance for product owners and delivery managers
Business Value	From weeks to minutes for cohort identification

References

Have questions about this implementation? Check out the GitHub repository for the complete code.