Supported Languages

WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages.

Overview

WiMarka is designed to evaluate translations from English to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair.

Language Codes

Code	Language	Native Name	Primary Use
`EN`	English	English	Source language
`CEB`	Cebuano	Binisaya	Target language
`ILO`	Ilocano	Ilokano	Target language
`TGT`	Tagalog	Tagalog/Filipino	Target language

Supported Language Pairs

Currently Supported

WiMarka officially supports the following translation directions:

English → Cebuano (EN → CEB)
English → Ilocano (EN → ILO)
English → Tagalog (EN → TGT)

Example Usage

# English to Cebuano
wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB')

# English to Ilocano
wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO')

# English to Tagalog
wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT')

Not Currently Supported

The following translation directions are not yet supported:

Philippine language → English (reverse direction)
Philippine language → Philippine language (e.g., CEB → TGT)
Multilingual evaluations

Note

Support for additional language pairs may be added in future versions.

Language Details

English (EN)

Role: Source language

Usage:

Typically used as the source text in evaluations
Can also be used as target for reverse translation evaluation (future feature)

Characteristics:

Subject-Verb-Object (SVO) word order
Rich vocabulary with Latin and Germanic roots
Minimal inflection

Cebuano (CEB)

Also Known As: Bisaya, Binisaya, Sugbuanon

Speakers: ~27 million (primarily in Visayas and Mindanao, Philippines)

Usage in WiMarka:

Target language for EN → CEB evaluations
Specialized error detection for Cebuano grammar

Linguistic Characteristics:

Verb-Subject-Object (VSO) word order typical
Extensive use of affixes (prefixes, infixes, suffixes)
Aspect-focused verb system
Linker particles (e.g., “nga”)

Example Translation:

EN:  Good morning! How are you?
CEB: Maayong buntag! Kumusta ka?

Common Patterns WiMarka Checks:

Proper use of aspect markers
Correct linker usage
Appropriate formality levels

Ilocano (ILO)

Also Known As: Ilokano, Iloko

Speakers: ~9 million (primarily in Northern Luzon, Philippines)

Usage in WiMarka:

Target language for EN → ILO evaluations
Specialized error detection for Ilocano syntax

Linguistic Characteristics:

Verb-initial word order
Extensive verbal morphology
Case marking system (nominative, genitive, oblique)
Reduplication for intensification and plurality

Example Translation:

EN:  Good morning! How are you?
ILO: Naimbag a bigat! Kumusta kan?

Common Patterns WiMarka Checks:

Case marker usage (ti/iti/dagiti)
Verbal affix correctness
Proper ligature (“a”) usage

Tagalog (TGT)

Also Known As: Filipino (official language of the Philippines)

Speakers: ~82 million (L1: ~29 million, L2: ~53 million)

Usage in WiMarka:

Target language for EN → TGT evaluations
Specialized error detection for Tagalog/Filipino

Linguistic Characteristics:

Verb-Subject-Object (VSO) word order typical
Complex focus system (actor, object, locative, benefactive)
Rich aspectual system
Linker particles (“na”, “ng”)

Example Translation:

EN:  Good morning! How are you?
TGT: Magandang umaga! Kumusta ka?

Common Patterns WiMarka Checks:

Focus marking correctness
Aspect and mood markers
Proper use of enclitics and proclitics

Language-Specific Considerations

Script and Encoding

All supported languages use the Latin alphabet with the following considerations:

Encoding: All files must be UTF-8 encoded
Diacritics: Rare in modern usage, but supported
Special Characters: Standard ASCII characters recommended

Example of proper encoding:

# Check file encoding
file -I filename.txt
# Should show: text/plain; charset=utf-8

Formality and Register

WiMarka evaluates translations considering appropriate formality levels:

Cebuano:

Formal: “Kumusta kamo?” (you, plural/formal)
Informal: “Kumusta ka?” (you, singular/informal)

Ilocano:

Formal: “Kumusta forkayo?”
Informal: “Kumusta ka?”

Tagalog:

Formal: “Kumusta po kayo?”
Informal: “Kumusta ka?”

Regional Variations

Philippine languages have regional variations:

Cebuano:

Urban Cebu dialect (standard reference)
Boholano variant
Mindanao variants

Ilocano:

Northern Ilocos dialect (standard reference)
Southern variations

Tagalog:

Manila dialect (basis for Filipino)
Provincial variations

Note

WiMarka’s models are trained primarily on standard/prestige dialects but may recognize common regional variations.

Code-Switching and Borrowings

Philippine languages frequently incorporate English loanwords and code-switching:

Acceptable:

EN:  I will send you an email.
TGT: Magpapadala ako sa iyo ng email.
# "email" is an accepted loanword

WiMarka’s Approach:

Common English loanwords are recognized
Excessive code-switching may lower fluency scores
Technical terms in English are usually acceptable

Spelling Conventions

WiMarka recognizes multiple valid spelling conventions:

Example (Cebuano):

“maayo” / “maayong” (good)
“karon” / “karun” (now)

Example (Tagalog):

“rin” / “din” (also)
“ko” / “ng” variations

Future Language Support

Potential Future Additions

Languages under consideration for future support:

Hiligaynon (Ilonggo) - ~9 million speakers
Waray - ~3.6 million speakers
Kapampangan - ~2.9 million speakers
Pangasinan - ~1.5 million speakers
Bikol - ~2.5 million speakers

Reverse Translation Support

Future versions may support:

Philippine languages → English evaluation
Bidirectional quality assessment

Inter-Philippine Translation

Potential support for:

CEB ↔ TGT
ILO ↔ TGT
Other Philippine language pairs

Language Resources

For more information about Philippine languages:

Komisyon sa Wikang Filipino - Official language commission
Ethnologue - Languages of the Philippines
SIL Philippines - Language documentation

Best Practices

Choosing the Right Language Code

Verify the actual language of your text
Use consistent codes across your evaluation pipeline
Consider dialectal variation in your source material

Handling Multilingual Content

If your text contains multiple languages:

# Not ideal - mixed languages in one file
EN: Hello, how are you?
EN: Kumusta ka? [This is actually Tagalog]

# Better - separate by actual language
EN: Hello, how are you?
TGT: Kumusta ka?

Quality of Input Texts

For best results:

Use native speaker translations when possible
Ensure proper grammar in source texts
Maintain consistent terminology
Avoid excessive code-switching

Troubleshooting

Language Detection Issues

If WiMarka produces unexpected results:

Verify language codes match actual content
Check for mixed languages in files
Ensure proper encoding (UTF-8)

Low Scores Despite Good Translation

Possible causes:

Regional dialect differences from training data
Non-standard spelling variations
Excessive code-switching or loanwords

See Understanding Output Format for interpreting scores and Examples for language-specific examples.