Supported Languages
===================
WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages.
Overview
--------
WiMarka is designed to evaluate translations **from English** to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair.
Language Codes
--------------
.. list-table::
:header-rows: 1
:widths: 15 25 20 40
* - Code
- Language
- Native Name
- Primary Use
* - ``EN``
- English
- English
- Source language
* - ``CEB``
- Cebuano
- Binisaya
- Target language
* - ``ILO``
- Ilocano
- Ilokano
- Target language
* - ``TGT``
- Tagalog
- Tagalog/Filipino
- Target language
Supported Language Pairs
-------------------------
Currently Supported
~~~~~~~~~~~~~~~~~~~
WiMarka officially supports the following translation directions:
1. **English → Cebuano** (EN → CEB)
2. **English → Ilocano** (EN → ILO)
3. **English → Tagalog** (EN → TGT)
Example Usage
^^^^^^^^^^^^^
.. code-block:: python
# English to Cebuano
wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB')
# English to Ilocano
wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO')
# English to Tagalog
wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT')
Not Currently Supported
~~~~~~~~~~~~~~~~~~~~~~~
The following translation directions are **not yet supported**:
* Philippine language → English (reverse direction)
* Philippine language → Philippine language (e.g., CEB → TGT)
* Multilingual evaluations
.. note::
Support for additional language pairs may be added in future versions.
Language Details
----------------
English (EN)
~~~~~~~~~~~~
**Role**: Source language
**Usage**:
* Typically used as the source text in evaluations
* Can also be used as target for reverse translation evaluation (future feature)
**Characteristics**:
* Subject-Verb-Object (SVO) word order
* Rich vocabulary with Latin and Germanic roots
* Minimal inflection
Cebuano (CEB)
~~~~~~~~~~~~~
**Also Known As**: Bisaya, Binisaya, Sugbuanon
**Speakers**: ~27 million (primarily in Visayas and Mindanao, Philippines)
**Usage in WiMarka**:
* Target language for EN → CEB evaluations
* Specialized error detection for Cebuano grammar
**Linguistic Characteristics**:
* Verb-Subject-Object (VSO) word order typical
* Extensive use of affixes (prefixes, infixes, suffixes)
* Aspect-focused verb system
* Linker particles (e.g., "nga")
**Example Translation**:
.. code-block:: text
EN: Good morning! How are you?
CEB: Maayong buntag! Kumusta ka?
**Common Patterns WiMarka Checks**:
* Proper use of aspect markers
* Correct linker usage
* Appropriate formality levels
Ilocano (ILO)
~~~~~~~~~~~~~
**Also Known As**: Ilokano, Iloko
**Speakers**: ~9 million (primarily in Northern Luzon, Philippines)
**Usage in WiMarka**:
* Target language for EN → ILO evaluations
* Specialized error detection for Ilocano syntax
**Linguistic Characteristics**:
* Verb-initial word order
* Extensive verbal morphology
* Case marking system (nominative, genitive, oblique)
* Reduplication for intensification and plurality
**Example Translation**:
.. code-block:: text
EN: Good morning! How are you?
ILO: Naimbag a bigat! Kumusta kan?
**Common Patterns WiMarka Checks**:
* Case marker usage (ti/iti/dagiti)
* Verbal affix correctness
* Proper ligature ("a") usage
Tagalog (TGT)
~~~~~~~~~~~~~
**Also Known As**: Filipino (official language of the Philippines)
**Speakers**: ~82 million (L1: ~29 million, L2: ~53 million)
**Usage in WiMarka**:
* Target language for EN → TGT evaluations
* Specialized error detection for Tagalog/Filipino
**Linguistic Characteristics**:
* Verb-Subject-Object (VSO) word order typical
* Complex focus system (actor, object, locative, benefactive)
* Rich aspectual system
* Linker particles ("na", "ng")
**Example Translation**:
.. code-block:: text
EN: Good morning! How are you?
TGT: Magandang umaga! Kumusta ka?
**Common Patterns WiMarka Checks**:
* Focus marking correctness
* Aspect and mood markers
* Proper use of enclitics and proclitics
Language-Specific Considerations
---------------------------------
Script and Encoding
~~~~~~~~~~~~~~~~~~~
All supported languages use the **Latin alphabet** with the following considerations:
* **Encoding**: All files must be UTF-8 encoded
* **Diacritics**: Rare in modern usage, but supported
* **Special Characters**: Standard ASCII characters recommended
Example of proper encoding:
.. code-block:: bash
# Check file encoding
file -I filename.txt
# Should show: text/plain; charset=utf-8
Formality and Register
~~~~~~~~~~~~~~~~~~~~~~
WiMarka evaluates translations considering appropriate formality levels:
**Cebuano**:
* Formal: "Kumusta kamo?" (you, plural/formal)
* Informal: "Kumusta ka?" (you, singular/informal)
**Ilocano**:
* Formal: "Kumusta forkayo?"
* Informal: "Kumusta ka?"
**Tagalog**:
* Formal: "Kumusta po kayo?"
* Informal: "Kumusta ka?"
Regional Variations
~~~~~~~~~~~~~~~~~~~
Philippine languages have regional variations:
**Cebuano**:
* Urban Cebu dialect (standard reference)
* Boholano variant
* Mindanao variants
**Ilocano**:
* Northern Ilocos dialect (standard reference)
* Southern variations
**Tagalog**:
* Manila dialect (basis for Filipino)
* Provincial variations
.. note::
WiMarka's models are trained primarily on standard/prestige dialects but may recognize common regional variations.
Code-Switching and Borrowings
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Philippine languages frequently incorporate English loanwords and code-switching:
**Acceptable**:
.. code-block:: text
EN: I will send you an email.
TGT: Magpapadala ako sa iyo ng email.
# "email" is an accepted loanword
**WiMarka's Approach**:
* Common English loanwords are recognized
* Excessive code-switching may lower fluency scores
* Technical terms in English are usually acceptable
Spelling Conventions
~~~~~~~~~~~~~~~~~~~~
WiMarka recognizes multiple valid spelling conventions:
**Example (Cebuano)**:
* "maayo" / "maayong" (good)
* "karon" / "karun" (now)
**Example (Tagalog)**:
* "rin" / "din" (also)
* "ko" / "ng" variations
Future Language Support
-----------------------
Potential Future Additions
~~~~~~~~~~~~~~~~~~~~~~~~~~
Languages under consideration for future support:
* **Hiligaynon** (Ilonggo) - ~9 million speakers
* **Waray** - ~3.6 million speakers
* **Kapampangan** - ~2.9 million speakers
* **Pangasinan** - ~1.5 million speakers
* **Bikol** - ~2.5 million speakers
Reverse Translation Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Future versions may support:
* Philippine languages → English evaluation
* Bidirectional quality assessment
Inter-Philippine Translation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Potential support for:
* CEB ↔ TGT
* ILO ↔ TGT
* Other Philippine language pairs
Language Resources
------------------
For more information about Philippine languages:
* `Komisyon sa Wikang Filipino `_ - Official language commission
* `Ethnologue - Languages of the Philippines `_
* SIL Philippines - Language documentation
Best Practices
--------------
Choosing the Right Language Code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. **Verify the actual language** of your text
2. **Use consistent codes** across your evaluation pipeline
3. **Consider dialectal variation** in your source material
Handling Multilingual Content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If your text contains multiple languages:
.. code-block:: text
# Not ideal - mixed languages in one file
EN: Hello, how are you?
EN: Kumusta ka? [This is actually Tagalog]
# Better - separate by actual language
EN: Hello, how are you?
TGT: Kumusta ka?
Quality of Input Texts
~~~~~~~~~~~~~~~~~~~~~~
For best results:
* Use native speaker translations when possible
* Ensure proper grammar in source texts
* Maintain consistent terminology
* Avoid excessive code-switching
Troubleshooting
---------------
Language Detection Issues
~~~~~~~~~~~~~~~~~~~~~~~~~
If WiMarka produces unexpected results:
1. **Verify language codes** match actual content
2. **Check for mixed languages** in files
3. **Ensure proper encoding** (UTF-8)
Low Scores Despite Good Translation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Possible causes:
* Regional dialect differences from training data
* Non-standard spelling variations
* Excessive code-switching or loanwords
See :doc:`output_format` for interpreting scores and :doc:`examples` for language-specific examples.