Supported Languages =================== WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages. Overview -------- WiMarka is designed to evaluate translations **from English** to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair. Language Codes -------------- .. list-table:: :header-rows: 1 :widths: 15 25 20 40 * - Code - Language - Native Name - Primary Use * - ``EN`` - English - English - Source language * - ``CEB`` - Cebuano - Binisaya - Target language * - ``ILO`` - Ilocano - Ilokano - Target language * - ``TGT`` - Tagalog - Tagalog/Filipino - Target language Supported Language Pairs ------------------------- Currently Supported ~~~~~~~~~~~~~~~~~~~ WiMarka officially supports the following translation directions: 1. **English → Cebuano** (EN → CEB) 2. **English → Ilocano** (EN → ILO) 3. **English → Tagalog** (EN → TGT) Example Usage ^^^^^^^^^^^^^ .. code-block:: python # English to Cebuano wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB') # English to Ilocano wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO') # English to Tagalog wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT') Not Currently Supported ~~~~~~~~~~~~~~~~~~~~~~~ The following translation directions are **not yet supported**: * Philippine language → English (reverse direction) * Philippine language → Philippine language (e.g., CEB → TGT) * Multilingual evaluations .. note:: Support for additional language pairs may be added in future versions. Language Details ---------------- English (EN) ~~~~~~~~~~~~ **Role**: Source language **Usage**: * Typically used as the source text in evaluations * Can also be used as target for reverse translation evaluation (future feature) **Characteristics**: * Subject-Verb-Object (SVO) word order * Rich vocabulary with Latin and Germanic roots * Minimal inflection Cebuano (CEB) ~~~~~~~~~~~~~ **Also Known As**: Bisaya, Binisaya, Sugbuanon **Speakers**: ~27 million (primarily in Visayas and Mindanao, Philippines) **Usage in WiMarka**: * Target language for EN → CEB evaluations * Specialized error detection for Cebuano grammar **Linguistic Characteristics**: * Verb-Subject-Object (VSO) word order typical * Extensive use of affixes (prefixes, infixes, suffixes) * Aspect-focused verb system * Linker particles (e.g., "nga") **Example Translation**: .. code-block:: text EN: Good morning! How are you? CEB: Maayong buntag! Kumusta ka? **Common Patterns WiMarka Checks**: * Proper use of aspect markers * Correct linker usage * Appropriate formality levels Ilocano (ILO) ~~~~~~~~~~~~~ **Also Known As**: Ilokano, Iloko **Speakers**: ~9 million (primarily in Northern Luzon, Philippines) **Usage in WiMarka**: * Target language for EN → ILO evaluations * Specialized error detection for Ilocano syntax **Linguistic Characteristics**: * Verb-initial word order * Extensive verbal morphology * Case marking system (nominative, genitive, oblique) * Reduplication for intensification and plurality **Example Translation**: .. code-block:: text EN: Good morning! How are you? ILO: Naimbag a bigat! Kumusta kan? **Common Patterns WiMarka Checks**: * Case marker usage (ti/iti/dagiti) * Verbal affix correctness * Proper ligature ("a") usage Tagalog (TGT) ~~~~~~~~~~~~~ **Also Known As**: Filipino (official language of the Philippines) **Speakers**: ~82 million (L1: ~29 million, L2: ~53 million) **Usage in WiMarka**: * Target language for EN → TGT evaluations * Specialized error detection for Tagalog/Filipino **Linguistic Characteristics**: * Verb-Subject-Object (VSO) word order typical * Complex focus system (actor, object, locative, benefactive) * Rich aspectual system * Linker particles ("na", "ng") **Example Translation**: .. code-block:: text EN: Good morning! How are you? TGT: Magandang umaga! Kumusta ka? **Common Patterns WiMarka Checks**: * Focus marking correctness * Aspect and mood markers * Proper use of enclitics and proclitics Language-Specific Considerations --------------------------------- Script and Encoding ~~~~~~~~~~~~~~~~~~~ All supported languages use the **Latin alphabet** with the following considerations: * **Encoding**: All files must be UTF-8 encoded * **Diacritics**: Rare in modern usage, but supported * **Special Characters**: Standard ASCII characters recommended Example of proper encoding: .. code-block:: bash # Check file encoding file -I filename.txt # Should show: text/plain; charset=utf-8 Formality and Register ~~~~~~~~~~~~~~~~~~~~~~ WiMarka evaluates translations considering appropriate formality levels: **Cebuano**: * Formal: "Kumusta kamo?" (you, plural/formal) * Informal: "Kumusta ka?" (you, singular/informal) **Ilocano**: * Formal: "Kumusta forkayo?" * Informal: "Kumusta ka?" **Tagalog**: * Formal: "Kumusta po kayo?" * Informal: "Kumusta ka?" Regional Variations ~~~~~~~~~~~~~~~~~~~ Philippine languages have regional variations: **Cebuano**: * Urban Cebu dialect (standard reference) * Boholano variant * Mindanao variants **Ilocano**: * Northern Ilocos dialect (standard reference) * Southern variations **Tagalog**: * Manila dialect (basis for Filipino) * Provincial variations .. note:: WiMarka's models are trained primarily on standard/prestige dialects but may recognize common regional variations. Code-Switching and Borrowings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Philippine languages frequently incorporate English loanwords and code-switching: **Acceptable**: .. code-block:: text EN: I will send you an email. TGT: Magpapadala ako sa iyo ng email. # "email" is an accepted loanword **WiMarka's Approach**: * Common English loanwords are recognized * Excessive code-switching may lower fluency scores * Technical terms in English are usually acceptable Spelling Conventions ~~~~~~~~~~~~~~~~~~~~ WiMarka recognizes multiple valid spelling conventions: **Example (Cebuano)**: * "maayo" / "maayong" (good) * "karon" / "karun" (now) **Example (Tagalog)**: * "rin" / "din" (also) * "ko" / "ng" variations Future Language Support ----------------------- Potential Future Additions ~~~~~~~~~~~~~~~~~~~~~~~~~~ Languages under consideration for future support: * **Hiligaynon** (Ilonggo) - ~9 million speakers * **Waray** - ~3.6 million speakers * **Kapampangan** - ~2.9 million speakers * **Pangasinan** - ~1.5 million speakers * **Bikol** - ~2.5 million speakers Reverse Translation Support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Future versions may support: * Philippine languages → English evaluation * Bidirectional quality assessment Inter-Philippine Translation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Potential support for: * CEB ↔ TGT * ILO ↔ TGT * Other Philippine language pairs Language Resources ------------------ For more information about Philippine languages: * `Komisyon sa Wikang Filipino `_ - Official language commission * `Ethnologue - Languages of the Philippines `_ * SIL Philippines - Language documentation Best Practices -------------- Choosing the Right Language Code ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. **Verify the actual language** of your text 2. **Use consistent codes** across your evaluation pipeline 3. **Consider dialectal variation** in your source material Handling Multilingual Content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your text contains multiple languages: .. code-block:: text # Not ideal - mixed languages in one file EN: Hello, how are you? EN: Kumusta ka? [This is actually Tagalog] # Better - separate by actual language EN: Hello, how are you? TGT: Kumusta ka? Quality of Input Texts ~~~~~~~~~~~~~~~~~~~~~~ For best results: * Use native speaker translations when possible * Ensure proper grammar in source texts * Maintain consistent terminology * Avoid excessive code-switching Troubleshooting --------------- Language Detection Issues ~~~~~~~~~~~~~~~~~~~~~~~~~ If WiMarka produces unexpected results: 1. **Verify language codes** match actual content 2. **Check for mixed languages** in files 3. **Ensure proper encoding** (UTF-8) Low Scores Despite Good Translation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Possible causes: * Regional dialect differences from training data * Non-standard spelling variations * Excessive code-switching or loanwords See :doc:`output_format` for interpreting scores and :doc:`examples` for language-specific examples.