Supported Languages
WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages.
Overview
WiMarka is designed to evaluate translations from English to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair.
Language Codes
Code |
Language |
Native Name |
Primary Use |
|---|---|---|---|
|
English |
English |
Source language |
|
Cebuano |
Binisaya |
Target language |
|
Ilocano |
Ilokano |
Target language |
|
Tagalog |
Tagalog/Filipino |
Target language |
Supported Language Pairs
Currently Supported
WiMarka officially supports the following translation directions:
English → Cebuano (EN → CEB)
English → Ilocano (EN → ILO)
English → Tagalog (EN → TGT)
Example Usage
# English to Cebuano
wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB')
# English to Ilocano
wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO')
# English to Tagalog
wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT')
Not Currently Supported
The following translation directions are not yet supported:
Philippine language → English (reverse direction)
Philippine language → Philippine language (e.g., CEB → TGT)
Multilingual evaluations
Note
Support for additional language pairs may be added in future versions.
Language Details
English (EN)
Role: Source language
- Usage:
Typically used as the source text in evaluations
Can also be used as target for reverse translation evaluation (future feature)
- Characteristics:
Subject-Verb-Object (SVO) word order
Rich vocabulary with Latin and Germanic roots
Minimal inflection
Cebuano (CEB)
Also Known As: Bisaya, Binisaya, Sugbuanon
Speakers: ~27 million (primarily in Visayas and Mindanao, Philippines)
- Usage in WiMarka:
Target language for EN → CEB evaluations
Specialized error detection for Cebuano grammar
- Linguistic Characteristics:
Verb-Subject-Object (VSO) word order typical
Extensive use of affixes (prefixes, infixes, suffixes)
Aspect-focused verb system
Linker particles (e.g., “nga”)
Example Translation:
EN: Good morning! How are you?
CEB: Maayong buntag! Kumusta ka?
- Common Patterns WiMarka Checks:
Proper use of aspect markers
Correct linker usage
Appropriate formality levels
Ilocano (ILO)
Also Known As: Ilokano, Iloko
Speakers: ~9 million (primarily in Northern Luzon, Philippines)
- Usage in WiMarka:
Target language for EN → ILO evaluations
Specialized error detection for Ilocano syntax
- Linguistic Characteristics:
Verb-initial word order
Extensive verbal morphology
Case marking system (nominative, genitive, oblique)
Reduplication for intensification and plurality
Example Translation:
EN: Good morning! How are you?
ILO: Naimbag a bigat! Kumusta kan?
- Common Patterns WiMarka Checks:
Case marker usage (ti/iti/dagiti)
Verbal affix correctness
Proper ligature (“a”) usage
Tagalog (TGT)
Also Known As: Filipino (official language of the Philippines)
Speakers: ~82 million (L1: ~29 million, L2: ~53 million)
- Usage in WiMarka:
Target language for EN → TGT evaluations
Specialized error detection for Tagalog/Filipino
- Linguistic Characteristics:
Verb-Subject-Object (VSO) word order typical
Complex focus system (actor, object, locative, benefactive)
Rich aspectual system
Linker particles (“na”, “ng”)
Example Translation:
EN: Good morning! How are you?
TGT: Magandang umaga! Kumusta ka?
- Common Patterns WiMarka Checks:
Focus marking correctness
Aspect and mood markers
Proper use of enclitics and proclitics
Language-Specific Considerations
Script and Encoding
All supported languages use the Latin alphabet with the following considerations:
Encoding: All files must be UTF-8 encoded
Diacritics: Rare in modern usage, but supported
Special Characters: Standard ASCII characters recommended
Example of proper encoding:
# Check file encoding
file -I filename.txt
# Should show: text/plain; charset=utf-8
Formality and Register
WiMarka evaluates translations considering appropriate formality levels:
- Cebuano:
Formal: “Kumusta kamo?” (you, plural/formal)
Informal: “Kumusta ka?” (you, singular/informal)
- Ilocano:
Formal: “Kumusta forkayo?”
Informal: “Kumusta ka?”
- Tagalog:
Formal: “Kumusta po kayo?”
Informal: “Kumusta ka?”
Regional Variations
Philippine languages have regional variations:
- Cebuano:
Urban Cebu dialect (standard reference)
Boholano variant
Mindanao variants
- Ilocano:
Northern Ilocos dialect (standard reference)
Southern variations
- Tagalog:
Manila dialect (basis for Filipino)
Provincial variations
Note
WiMarka’s models are trained primarily on standard/prestige dialects but may recognize common regional variations.
Code-Switching and Borrowings
Philippine languages frequently incorporate English loanwords and code-switching:
Acceptable:
EN: I will send you an email.
TGT: Magpapadala ako sa iyo ng email.
# "email" is an accepted loanword
- WiMarka’s Approach:
Common English loanwords are recognized
Excessive code-switching may lower fluency scores
Technical terms in English are usually acceptable
Spelling Conventions
WiMarka recognizes multiple valid spelling conventions:
- Example (Cebuano):
“maayo” / “maayong” (good)
“karon” / “karun” (now)
- Example (Tagalog):
“rin” / “din” (also)
“ko” / “ng” variations
Future Language Support
Potential Future Additions
Languages under consideration for future support:
Hiligaynon (Ilonggo) - ~9 million speakers
Waray - ~3.6 million speakers
Kapampangan - ~2.9 million speakers
Pangasinan - ~1.5 million speakers
Bikol - ~2.5 million speakers
Reverse Translation Support
Future versions may support:
Philippine languages → English evaluation
Bidirectional quality assessment
Inter-Philippine Translation
Potential support for:
CEB ↔ TGT
ILO ↔ TGT
Other Philippine language pairs
Language Resources
For more information about Philippine languages:
Komisyon sa Wikang Filipino - Official language commission
SIL Philippines - Language documentation
Best Practices
Choosing the Right Language Code
Verify the actual language of your text
Use consistent codes across your evaluation pipeline
Consider dialectal variation in your source material
Handling Multilingual Content
If your text contains multiple languages:
# Not ideal - mixed languages in one file
EN: Hello, how are you?
EN: Kumusta ka? [This is actually Tagalog]
# Better - separate by actual language
EN: Hello, how are you?
TGT: Kumusta ka?
Quality of Input Texts
For best results:
Use native speaker translations when possible
Ensure proper grammar in source texts
Maintain consistent terminology
Avoid excessive code-switching
Troubleshooting
Language Detection Issues
If WiMarka produces unexpected results:
Verify language codes match actual content
Check for mixed languages in files
Ensure proper encoding (UTF-8)
Low Scores Despite Good Translation
Possible causes:
Regional dialect differences from training data
Non-standard spelling variations
Excessive code-switching or loanwords
See Understanding Output Format for interpreting scores and Examples for language-specific examples.