Supported Languages
===================

WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages.

Overview
--------

WiMarka is designed to evaluate translations **from English** to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair.

Language Codes
--------------

.. list-table::
   :header-rows: 1
   :widths: 15 25 20 40

   * - Code
     - Language
     - Native Name
     - Primary Use
   * - ``EN``
     - English
     - English
     - Source language
   * - ``CEB``
     - Cebuano
     - Binisaya
     - Target language
   * - ``ILO``
     - Ilocano
     - Ilokano
     - Target language
   * - ``TGT``
     - Tagalog
     - Tagalog/Filipino
     - Target language

Supported Language Pairs
-------------------------

Currently Supported
~~~~~~~~~~~~~~~~~~~

WiMarka officially supports the following translation directions:

1. **English → Cebuano** (EN → CEB)
2. **English → Ilocano** (EN → ILO)
3. **English → Tagalog** (EN → TGT)

Example Usage
^^^^^^^^^^^^^

.. code-block:: python

   # English to Cebuano
   wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB')
   
   # English to Ilocano
   wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO')
   
   # English to Tagalog
   wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT')

Not Currently Supported
~~~~~~~~~~~~~~~~~~~~~~~

The following translation directions are **not yet supported**:

* Philippine language → English (reverse direction)
* Philippine language → Philippine language (e.g., CEB → TGT)
* Multilingual evaluations

.. note::
   Support for additional language pairs may be added in future versions.

Language Details
----------------

English (EN)
~~~~~~~~~~~~

**Role**: Source language

**Usage**: 
   * Typically used as the source text in evaluations
   * Can also be used as target for reverse translation evaluation (future feature)

**Characteristics**:
   * Subject-Verb-Object (SVO) word order
   * Rich vocabulary with Latin and Germanic roots
   * Minimal inflection

Cebuano (CEB)
~~~~~~~~~~~~~

**Also Known As**: Bisaya, Binisaya, Sugbuanon

**Speakers**: ~27 million (primarily in Visayas and Mindanao, Philippines)

**Usage in WiMarka**:
   * Target language for EN → CEB evaluations
   * Specialized error detection for Cebuano grammar

**Linguistic Characteristics**:
   * Verb-Subject-Object (VSO) word order typical
   * Extensive use of affixes (prefixes, infixes, suffixes)
   * Aspect-focused verb system
   * Linker particles (e.g., "nga")

**Example Translation**:

.. code-block:: text

   EN:  Good morning! How are you?
   CEB: Maayong buntag! Kumusta ka?

**Common Patterns WiMarka Checks**:
   * Proper use of aspect markers
   * Correct linker usage
   * Appropriate formality levels

Ilocano (ILO)
~~~~~~~~~~~~~

**Also Known As**: Ilokano, Iloko

**Speakers**: ~9 million (primarily in Northern Luzon, Philippines)

**Usage in WiMarka**:
   * Target language for EN → ILO evaluations
   * Specialized error detection for Ilocano syntax

**Linguistic Characteristics**:
   * Verb-initial word order
   * Extensive verbal morphology
   * Case marking system (nominative, genitive, oblique)
   * Reduplication for intensification and plurality

**Example Translation**:

.. code-block:: text

   EN:  Good morning! How are you?
   ILO: Naimbag a bigat! Kumusta kan?

**Common Patterns WiMarka Checks**:
   * Case marker usage (ti/iti/dagiti)
   * Verbal affix correctness
   * Proper ligature ("a") usage

Tagalog (TGT)
~~~~~~~~~~~~~

**Also Known As**: Filipino (official language of the Philippines)

**Speakers**: ~82 million (L1: ~29 million, L2: ~53 million)

**Usage in WiMarka**:
   * Target language for EN → TGT evaluations
   * Specialized error detection for Tagalog/Filipino

**Linguistic Characteristics**:
   * Verb-Subject-Object (VSO) word order typical
   * Complex focus system (actor, object, locative, benefactive)
   * Rich aspectual system
   * Linker particles ("na", "ng")

**Example Translation**:

.. code-block:: text

   EN:  Good morning! How are you?
   TGT: Magandang umaga! Kumusta ka?

**Common Patterns WiMarka Checks**:
   * Focus marking correctness
   * Aspect and mood markers
   * Proper use of enclitics and proclitics

Language-Specific Considerations
---------------------------------

Script and Encoding
~~~~~~~~~~~~~~~~~~~

All supported languages use the **Latin alphabet** with the following considerations:

* **Encoding**: All files must be UTF-8 encoded
* **Diacritics**: Rare in modern usage, but supported
* **Special Characters**: Standard ASCII characters recommended

Example of proper encoding:

.. code-block:: bash

   # Check file encoding
   file -I filename.txt
   # Should show: text/plain; charset=utf-8

Formality and Register
~~~~~~~~~~~~~~~~~~~~~~

WiMarka evaluates translations considering appropriate formality levels:

**Cebuano**:
   * Formal: "Kumusta kamo?" (you, plural/formal)
   * Informal: "Kumusta ka?" (you, singular/informal)

**Ilocano**:
   * Formal: "Kumusta forkayo?"
   * Informal: "Kumusta ka?"

**Tagalog**:
   * Formal: "Kumusta po kayo?"
   * Informal: "Kumusta ka?"

Regional Variations
~~~~~~~~~~~~~~~~~~~

Philippine languages have regional variations:

**Cebuano**:
   * Urban Cebu dialect (standard reference)
   * Boholano variant
   * Mindanao variants

**Ilocano**:
   * Northern Ilocos dialect (standard reference)
   * Southern variations

**Tagalog**:
   * Manila dialect (basis for Filipino)
   * Provincial variations

.. note::
   WiMarka's models are trained primarily on standard/prestige dialects but may recognize common regional variations.

Code-Switching and Borrowings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Philippine languages frequently incorporate English loanwords and code-switching:

**Acceptable**:

.. code-block:: text

   EN:  I will send you an email.
   TGT: Magpapadala ako sa iyo ng email.
   # "email" is an accepted loanword

**WiMarka's Approach**:
   * Common English loanwords are recognized
   * Excessive code-switching may lower fluency scores
   * Technical terms in English are usually acceptable

Spelling Conventions
~~~~~~~~~~~~~~~~~~~~

WiMarka recognizes multiple valid spelling conventions:

**Example (Cebuano)**:
   * "maayo" / "maayong" (good)
   * "karon" / "karun" (now)

**Example (Tagalog)**:
   * "rin" / "din" (also)
   * "ko" / "ng" variations

Future Language Support
-----------------------

Potential Future Additions
~~~~~~~~~~~~~~~~~~~~~~~~~~

Languages under consideration for future support:

* **Hiligaynon** (Ilonggo) - ~9 million speakers
* **Waray** - ~3.6 million speakers
* **Kapampangan** - ~2.9 million speakers
* **Pangasinan** - ~1.5 million speakers
* **Bikol** - ~2.5 million speakers

Reverse Translation Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Future versions may support:

* Philippine languages → English evaluation
* Bidirectional quality assessment

Inter-Philippine Translation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Potential support for:

* CEB ↔ TGT
* ILO ↔ TGT
* Other Philippine language pairs

Language Resources
------------------

For more information about Philippine languages:

* `Komisyon sa Wikang Filipino <http://www.kwf.gov.ph/>`_ - Official language commission
* `Ethnologue - Languages of the Philippines <https://www.ethnologue.com/country/PH>`_
* SIL Philippines - Language documentation

Best Practices
--------------

Choosing the Right Language Code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. **Verify the actual language** of your text
2. **Use consistent codes** across your evaluation pipeline
3. **Consider dialectal variation** in your source material

Handling Multilingual Content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your text contains multiple languages:

.. code-block:: text

   # Not ideal - mixed languages in one file
   EN: Hello, how are you?
   EN: Kumusta ka? [This is actually Tagalog]

   # Better - separate by actual language
   EN: Hello, how are you?
   TGT: Kumusta ka?

Quality of Input Texts
~~~~~~~~~~~~~~~~~~~~~~

For best results:

* Use native speaker translations when possible
* Ensure proper grammar in source texts
* Maintain consistent terminology
* Avoid excessive code-switching

Troubleshooting
---------------

Language Detection Issues
~~~~~~~~~~~~~~~~~~~~~~~~~

If WiMarka produces unexpected results:

1. **Verify language codes** match actual content
2. **Check for mixed languages** in files
3. **Ensure proper encoding** (UTF-8)

Low Scores Despite Good Translation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Possible causes:

* Regional dialect differences from training data
* Non-standard spelling variations
* Excessive code-switching or loanwords

See :doc:`output_format` for interpreting scores and :doc:`examples` for language-specific examples.