Supported Languages

WiMarka provides specialized support for evaluating machine translations involving English and three major Philippine languages.

Overview

WiMarka is designed to evaluate translations from English to Philippine languages, with specialized models trained on linguistic patterns specific to each language pair.

Language Codes

Code

Language

Native Name

Primary Use

EN

English

English

Source language

CEB

Cebuano

Binisaya

Target language

ILO

Ilocano

Ilokano

Target language

TGT

Tagalog

Tagalog/Filipino

Target language

Supported Language Pairs

Currently Supported

WiMarka officially supports the following translation directions:

  1. English → Cebuano (EN → CEB)

  2. English → Ilocano (EN → ILO)

  3. English → Tagalog (EN → TGT)

Example Usage

# English to Cebuano
wmk_eval('en.txt', 'EN', 'ceb.txt', 'CEB')

# English to Ilocano
wmk_eval('en.txt', 'EN', 'ilo.txt', 'ILO')

# English to Tagalog
wmk_eval('en.txt', 'EN', 'tgt.txt', 'TGT')

Not Currently Supported

The following translation directions are not yet supported:

  • Philippine language → English (reverse direction)

  • Philippine language → Philippine language (e.g., CEB → TGT)

  • Multilingual evaluations

Note

Support for additional language pairs may be added in future versions.

Language Details

English (EN)

Role: Source language

Usage:
  • Typically used as the source text in evaluations

  • Can also be used as target for reverse translation evaluation (future feature)

Characteristics:
  • Subject-Verb-Object (SVO) word order

  • Rich vocabulary with Latin and Germanic roots

  • Minimal inflection

Cebuano (CEB)

Also Known As: Bisaya, Binisaya, Sugbuanon

Speakers: ~27 million (primarily in Visayas and Mindanao, Philippines)

Usage in WiMarka:
  • Target language for EN → CEB evaluations

  • Specialized error detection for Cebuano grammar

Linguistic Characteristics:
  • Verb-Subject-Object (VSO) word order typical

  • Extensive use of affixes (prefixes, infixes, suffixes)

  • Aspect-focused verb system

  • Linker particles (e.g., “nga”)

Example Translation:

EN:  Good morning! How are you?
CEB: Maayong buntag! Kumusta ka?
Common Patterns WiMarka Checks:
  • Proper use of aspect markers

  • Correct linker usage

  • Appropriate formality levels

Ilocano (ILO)

Also Known As: Ilokano, Iloko

Speakers: ~9 million (primarily in Northern Luzon, Philippines)

Usage in WiMarka:
  • Target language for EN → ILO evaluations

  • Specialized error detection for Ilocano syntax

Linguistic Characteristics:
  • Verb-initial word order

  • Extensive verbal morphology

  • Case marking system (nominative, genitive, oblique)

  • Reduplication for intensification and plurality

Example Translation:

EN:  Good morning! How are you?
ILO: Naimbag a bigat! Kumusta kan?
Common Patterns WiMarka Checks:
  • Case marker usage (ti/iti/dagiti)

  • Verbal affix correctness

  • Proper ligature (“a”) usage

Tagalog (TGT)

Also Known As: Filipino (official language of the Philippines)

Speakers: ~82 million (L1: ~29 million, L2: ~53 million)

Usage in WiMarka:
  • Target language for EN → TGT evaluations

  • Specialized error detection for Tagalog/Filipino

Linguistic Characteristics:
  • Verb-Subject-Object (VSO) word order typical

  • Complex focus system (actor, object, locative, benefactive)

  • Rich aspectual system

  • Linker particles (“na”, “ng”)

Example Translation:

EN:  Good morning! How are you?
TGT: Magandang umaga! Kumusta ka?
Common Patterns WiMarka Checks:
  • Focus marking correctness

  • Aspect and mood markers

  • Proper use of enclitics and proclitics

Language-Specific Considerations

Script and Encoding

All supported languages use the Latin alphabet with the following considerations:

  • Encoding: All files must be UTF-8 encoded

  • Diacritics: Rare in modern usage, but supported

  • Special Characters: Standard ASCII characters recommended

Example of proper encoding:

# Check file encoding
file -I filename.txt
# Should show: text/plain; charset=utf-8

Formality and Register

WiMarka evaluates translations considering appropriate formality levels:

Cebuano:
  • Formal: “Kumusta kamo?” (you, plural/formal)

  • Informal: “Kumusta ka?” (you, singular/informal)

Ilocano:
  • Formal: “Kumusta forkayo?”

  • Informal: “Kumusta ka?”

Tagalog:
  • Formal: “Kumusta po kayo?”

  • Informal: “Kumusta ka?”

Regional Variations

Philippine languages have regional variations:

Cebuano:
  • Urban Cebu dialect (standard reference)

  • Boholano variant

  • Mindanao variants

Ilocano:
  • Northern Ilocos dialect (standard reference)

  • Southern variations

Tagalog:
  • Manila dialect (basis for Filipino)

  • Provincial variations

Note

WiMarka’s models are trained primarily on standard/prestige dialects but may recognize common regional variations.

Code-Switching and Borrowings

Philippine languages frequently incorporate English loanwords and code-switching:

Acceptable:

EN:  I will send you an email.
TGT: Magpapadala ako sa iyo ng email.
# "email" is an accepted loanword
WiMarka’s Approach:
  • Common English loanwords are recognized

  • Excessive code-switching may lower fluency scores

  • Technical terms in English are usually acceptable

Spelling Conventions

WiMarka recognizes multiple valid spelling conventions:

Example (Cebuano):
  • “maayo” / “maayong” (good)

  • “karon” / “karun” (now)

Example (Tagalog):
  • “rin” / “din” (also)

  • “ko” / “ng” variations

Future Language Support

Potential Future Additions

Languages under consideration for future support:

  • Hiligaynon (Ilonggo) - ~9 million speakers

  • Waray - ~3.6 million speakers

  • Kapampangan - ~2.9 million speakers

  • Pangasinan - ~1.5 million speakers

  • Bikol - ~2.5 million speakers

Reverse Translation Support

Future versions may support:

  • Philippine languages → English evaluation

  • Bidirectional quality assessment

Inter-Philippine Translation

Potential support for:

  • CEB ↔ TGT

  • ILO ↔ TGT

  • Other Philippine language pairs

Language Resources

For more information about Philippine languages:

Best Practices

Choosing the Right Language Code

  1. Verify the actual language of your text

  2. Use consistent codes across your evaluation pipeline

  3. Consider dialectal variation in your source material

Handling Multilingual Content

If your text contains multiple languages:

# Not ideal - mixed languages in one file
EN: Hello, how are you?
EN: Kumusta ka? [This is actually Tagalog]

# Better - separate by actual language
EN: Hello, how are you?
TGT: Kumusta ka?

Quality of Input Texts

For best results:

  • Use native speaker translations when possible

  • Ensure proper grammar in source texts

  • Maintain consistent terminology

  • Avoid excessive code-switching

Troubleshooting

Language Detection Issues

If WiMarka produces unexpected results:

  1. Verify language codes match actual content

  2. Check for mixed languages in files

  3. Ensure proper encoding (UTF-8)

Low Scores Despite Good Translation

Possible causes:

  • Regional dialect differences from training data

  • Non-standard spelling variations

  • Excessive code-switching or loanwords

See Understanding Output Format for interpreting scores and Examples for language-specific examples.