Medical NER Pipeline - Architecture Overview

mindmap
  root((Medical NER Pipeline))
    Input Processing
      Clinical Reports
        PDF Documents
        Text Files
        Excel Files
      Text Extraction
        PyMuPDF
        python-docx
      Preprocessing
        Sentence Segmentation
        Tokenization
    5-Stage Pipeline
      Stage 1: Base NLP
        spaCy Processing
          Tokenization
          POS Tagging
          Dependency Parsing
        scispaCy Models
          en_core_web_sm
          Medical Vocabulary
      Stage 2: Entity Extraction
        BioBERT Models
          Disease Model
            BC5CDR-disease
            42K+ Terms
          Chemical Model
            BC5CDR-chem
            5.2K Drugs
          Gene Model
            BC5CDR-gene
            10.2K Genes
        Template Boosting
          57,476 Curated Terms
          Pattern Matching
          Hybrid Approach
      Stage 3: Context Classification
        5 Context Types
          Confirmed 138 patterns
          Negated 99 patterns
          Uncertain 48 patterns
          Historical 82 patterns
          Family 79 patterns
        negspacy Integration
        Scope Reversal
          103 patterns
      Stage 4: Section Detection
        20+ Clinical Sections
          Chief Complaint
          History of Present Illness
          Past Medical History
          Medications
          Assessment and Plan
      Stage 5: Output Generation
        43-Column Excel
        Streamlit Dashboard
        JSON Export
    Technology Stack
      NLP Foundation
        spaCy 3.7+
        scispaCy
        negspacy
      ML Models
        Hugging Face Transformers
        PyTorch
        BioBERT Variants
      Data Processing
        pandas
        openpyxl
    Performance Metrics
      Entity Detection
        96% Accuracy
      Context Classification
        93% Accuracy