Architecture

This document describes WiMarka’s system architecture, design decisions, and data flow.

System Overview

WiMarka is designed as a modular machine translation evaluation system with a four-stage pipeline that processes sentence pairs to generate quality assessments.

High-Level Architecture

┌──────────────────────────────────────────────────────────┐
│                     User Interface                        │
├───────────────────┬──────────────────────────────────────┤
│  Python Library   │     Command-Line Interface (CLI)     │
└────────┬──────────┴──────────────┬───────────────────────┘
         │                          │
         └──────────┬──────────────┘
                    │
               ┌────▼────┐
               │ wmk_eval│  Main Entry Point
               └────┬────┘
                    │
┌───────────────────┼───────────────────┐
│              Evaluation Pipeline       │
│  ┌──────────────────────────────────┐  │
│  │  1. Error Detection              │  │
│  └─────────────┬────────────────────┘  │
│                │                        │
│  ┌─────────────▼────────────────────┐  │
│  │  2. Scoring                      │  │
│  └─────────────┬────────────────────┘  │
│                │                        │
│  ┌─────────────▼────────────────────┐  │
│  │  3. Explanation Generation       │  │
│  └─────────────┬────────────────────┘  │
│                │                        │
│  ┌─────────────▼────────────────────┐  │
│  │  4. Correction Suggestion        │  │
│  └─────────────┬────────────────────┘  │
└────────────────┼────────────────────────┘
                 │
      ┌──────────▼──────────┐
      │  Utilities Layer     │
      ├──────────────────────┤
      │  • Model Management  │
      │  • Caching           │
      │  • Logging           │
      │  • Helper Functions  │
      └──────────┬───────────┘
                 │
      ┌──────────▼──────────┐
      │   Language Models    │
      │  • Transformer LMs   │
      │  • LLM (llama-cpp)   │
      └─────────────────────┘

Core Components

Main Module (main.py)

Responsibility: Orchestrates the evaluation pipeline

Key Functions:

  • wmk_eval(): Main entry point for evaluation

  • Loads source and target files

  • Manages results dictionary

  • Coordinates task modules

Design Decisions:

  • Sequential processing ensures deterministic results

  • Global results dictionary for easy access

  • File-based input for batch processing

CLI Module (cli.py)

Responsibility: Command-line interface

Features:

  • Argument parsing with Click

  • Input validation

  • Error handling and user feedback

Integration: Wraps wmk_eval() with CLI argument handling

Task Modules

Four independent task modules implement the evaluation pipeline:

  1. error_detection.py: Identifies translation errors

  2. scoring.py: Calculates quality metrics

  3. explanation.py: Generates explanations

  4. correction.py: Suggests corrections

See Task Modules for detailed documentation.

Utility Modules

Support modules provide shared functionality:

  • helper.py: File I/O, language tag management

  • logger.py: Logging configuration

  • model.py: Model loading and management

  • cache.py: Response caching

  • torch.py: PyTorch device management

See Utility Modules for detailed documentation.

Evaluation Pipeline

Stage 1: Error Detection

Input: Tagged source and target sentences

src_line = "[EN] Good morning!"
tgt_line = "[CEB] Maayong gabii!"

Process:

  1. Feeds sentences to error detection model

  2. Analyzes syntactic and semantic differences

  3. Identifies specific error types

Output: List of detected errors

errors = ['Semantic mismatch: time of day']

Implementation Details:

  • Uses LLM-based analysis

  • Prompt engineering for error identification

  • Language-specific error patterns

Stage 2: Scoring

Input: Source sentence, target sentence, detected errors

Process:

  1. Evaluates fluency (grammatical correctness)

  2. Evaluates adequacy (meaning preservation)

  3. Calculates overall score

Output: Three numerical scores (0-100)

fluency_score = 95
adequacy_score = 40
overall_score = 67.5  # Average

Scoring Algorithm:

  • LLM-based scoring with structured prompts

  • Error count influences scores

  • Language-specific quality criteria

Stage 3: Explanation Generation

Input: All previous stage outputs

Process:

  1. Analyzes detected errors

  2. Considers score levels

  3. Generates human-readable explanation

Output: Natural language explanation

explanation = "The translation has incorrect time reference. 'Morning' was translated as 'gabii' (evening)."

Design:

  • Context-aware explanation generation

  • References specific errors

  • Educational tone for clarity

Stage 4: Correction Suggestion

Input: All previous stage outputs

Process:

  1. Analyzes errors and explanations

  2. Generates improved translation

  3. Validates correction quality

Output: Suggested corrected translation

corrected_translation = "Maayong buntag!"

Approach:

  • Error-informed correction

  • Preserves correct portions

  • Maintains semantic equivalence

Data Flow

Detailed data flow through the system:

┌─────────────┐
│ Input Files │
└──────┬──────┘
       │
       ▼
┌─────────────────────┐
│ wmk_eval()          │
│ - Load files        │
│ - Validate counts   │
│ - Add language tags │
└──────┬──────────────┘
       │
       ▼  (For each sentence pair)
┌────────────────────────┐
│ error_detection()      │
│ Input: src, tgt        │
│ Output: errors[]       │
└──────┬─────────────────┘
       │
       ▼
┌────────────────────────┐
│ scoring()              │
│ Input: src, tgt, errors│
│ Output: 3 scores       │
└──────┬─────────────────┘
       │
       ▼
┌────────────────────────┐
│ generate_explanation() │
│ Input: all above       │
│ Output: explanation    │
└──────┬─────────────────┘
       │
       ▼
┌────────────────────────┐
│ generate_correction()  │
│ Input: all above       │
│ Output: correction     │
└──────┬─────────────────┘
       │
       ▼
┌────────────────────────┐
│ results{}              │
│ - Append all outputs   │
└──────┬─────────────────┘
       │
       ▼  (After all sentences)
┌────────────────────────┐
│ printEvaluationResults │
└────────────────────────┘

Model Management

Model Loading Strategy

WiMarka uses lazy loading and caching:

  1. First Request: Model downloaded from HuggingFace Hub

  2. Subsequent Requests: Loaded from local cache

  3. Memory Management: Models loaded once per session

Cache Location:

  • Windows: C:\Users\<username>\.cache\huggingface\

  • macOS/Linux: ~/.cache/huggingface/

Model Types

WiMarka utilizes two types of models:

  1. Transformer Models (via transformers library)

    • Used for: Text embeddings, classification

    • Format: PyTorch checkpoints

  2. LLM Models (via llama-cpp-python)

    • Used for: Error detection, scoring, explanation, correction

    • Format: GGUF quantized models

    • Benefits: Efficient CPU inference

Device Management

Automatic device selection:

# From torch.py
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  • CUDA GPU if available

  • Falls back to CPU

  • No manual configuration needed

Performance Considerations

Optimization Strategies

  1. Model Caching

    • Models loaded once per session

    • Inference results cached when possible

    • Reduces redundant computation

  2. Sequential Processing

    • Sentences processed one at a time

    • Prevents memory overflow

    • Ensures deterministic results

  3. Efficient File I/O

    • Streaming file reading

    • UTF-8 encoding handled properly

    • Minimal memory footprint

Bottlenecks

Identified performance bottlenecks:

  • Model Download: First-time model download can be slow

  • LLM Inference: CPU inference slower than GPU

  • Large Files: Processing time scales linearly with file size

Scalability

Current limitations and future improvements:

Current:

  • Single-threaded processing

  • File-based input/output

  • In-memory results storage

Future Improvements:

  • Parallel sentence processing

  • Streaming API

  • Database integration for large-scale evaluation

Error Handling

Error Handling Strategy

WiMarka implements defensive programming:

  1. Input Validation

    • File existence checks

    • Line count validation

    • Language code verification

  2. Graceful Degradation

    • Model loading failures logged

    • Fallback mechanisms where possible

  3. Informative Errors

    • Clear error messages

    • Actionable suggestions

Exception Hierarchy

# Common exceptions
FileNotFoundError  # Input files missing
ValueError         # Invalid arguments, mismatched line counts
RuntimeError       # Model loading failures

Logging

Logging Architecture

Structured logging at multiple levels:

# From logger.py
logger.info("Starting evaluation...")      # Progress
logger.warning("Model cache miss")         # Warnings
logger.error("Failed to load model")       # Errors
logger.debug("Intermediate result: ...")   # Debugging

Log Levels:

  • INFO: Progress and status updates

  • WARNING: Non-critical issues

  • ERROR: Failures and exceptions

  • DEBUG: Detailed debugging information (disabled by default)

Configuration Management

Configuration Strategy

WiMarka uses config.py for centralized configuration:

  • Model paths and identifiers

  • Hyperparameters

  • Default settings

  • API endpoints

Design Principle: Configuration separate from code allows easy customization without modifying source.

Extensibility Points

WiMarka is designed for extension:

  1. New Languages

    • Add language codes to config.py

    • Update helper functions for new tags

    • Train/add language-specific models

  2. New Tasks

    • Create new module in tasks/

    • Integrate into main.py pipeline

    • Update results dictionary structure

  3. New Models

    • Add model identifiers to config.py

    • Update model.py loading logic

    • Ensure compatibility with existing interfaces

  4. Alternative Interfaces

    • Web API wrapper

    • GUI application

    • Integration with other tools

See Extending WiMarka for detailed guides on extending WiMarka.

Design Patterns

Patterns Used in WiMarka

  1. Pipeline Pattern

    • Sequential task execution

    • Each stage processes and passes data

    • Clear separation of concerns

  2. Lazy Initialization

    • Models loaded on first use

    • Reduces startup time

    • Efficient resource usage

  3. Facade Pattern

    • wmk_eval() provides simple interface

    • Complex pipeline hidden from users

    • Easy to use, hard to misuse

  4. Singleton Pattern

    • Global results dictionary

    • Logger instance

    • Model cache

Trade-offs

Simplicity vs. Flexibility:

  • Current: Simple API, less configuration

  • Trade-off: Limited customization options

Speed vs. Accuracy:

  • Current: CPU inference for accessibility

  • Trade-off: Slower than GPU-optimized solutions

Memory vs. Speed:

  • Current: Sequential processing

  • Trade-off: Slower but memory-efficient

Future Architecture Improvements

Planned Enhancements

  1. Asynchronous Processing

    • Non-blocking I/O

    • Parallel sentence evaluation

    • Progress callbacks

  2. Streaming API

    • Process large files efficiently

    • Real-time results

    • Lower memory usage

  3. Plugin System

    • Third-party task modules

    • Custom scoring algorithms

    • Community extensions

  4. Distributed Evaluation

    • Multi-machine processing

    • Cloud deployment

    • Horizontal scaling

References

Next Steps