Readability score in local languages

When writing content, not because you write correct grammar, audience will easily digest it. Other factors rather than grammar contributes to readable capacity too. Some of them named length and complication of sentences. To quantify the factors is a bit of tricky. It is hence readability score comes to spotlight.

Source: self drawing.
  1. What is readability score?

So what is readability score? It is number telling you how easy/hard for someone to read a particular writing. If the score is high, people likely find it easy to understand your sentences. Reversely, people still understand but takes time to digest complex words and sentences.

2. Readability in English language

Research of readability score is well developed in english language. Some of formula includes: Flesch Reading Ease,Flesch-Kincaid Grade Level, Gunning Fog Index etc (check here for more detail of formula). Their formula differs each other but share some common input such as: Number of words, number of sentence, complex words, polly-syllable words. However, when the input might not be compatible for other languages. Some of research focuses on developing formula for local language but as I know little is well recognized. In this article I will walk you through e some papers developing this score in local language. I also implement a package in python named "ReadabilityLola" for using in python. Check out my repo for more detail.

3. Malaysian language

They proposed a formula measuring reading proficiency in Malay text:

Yi = a + b*ni + c*(di+ ki)

Where:

i : sample in case study

ni: 300/S. Number of words in sentences. S is number of sentence in 300 words. In the implementation, I modify ni as number of sentence/Number of total words.

di: Number of syllables. In the implementation, I modify di = 300* number of syllables/number of words

ki: potential difficult words including Kata Ganda,Diftong, Kata Pinjaman and Kekeliruan huruf. Specifically ki = (Kata Ganda+Diftong+Kata Pinjaman+Kekeliruan Huruf)*5. In the implementation I modify ki = 300*(Kata Ganda+Diftong+Kata Pinjaman+Kekeliruan Huruf)*5/Number of words

a = -13.988, b = 0.3793, c = 0.0207

The output will be approximate as the education grade level to absorb the text.

4. Hindi language

After analyzing the result of 6 models per each language, they conclude the 2 most effective models for Hindi as:

-2.34+2.14*A WL+0.01*PSW (1)

0.211+1.37*A WL+.005*JUK (2)

Where:

AWL: average word length

PSW: Poly-syllabic words are the words whose count of syllable exceeds 2.

JUK: jukta-akshar or consonant-conjunct is consonants occuring together in clusters.

The 2nd model performs a bit better than the (1) one. However, due to the difficulty of calculating JUK, my package implements the (1) formula instead of.

The paper selects text ranging from 400 to 1000 words. So I modify the formula of PSW to adapt. In case content has more than 1000 words, PSW will be 1000*Number of poly-syllabic words/number of words. If content has less than 4000 words, PSW will be 400* number of poly-syllabic words/number of words. Else, PSW remains as number of poly-syllabic words/number of words.

5. Code & package

Regarding to implementation formula, I have implemented package and indexed in Pypi named ReadabilityLola. You can pip install ReadabilityLola as any python package. Check out my code in repo. In the future, I might implement score for more local languages if I find any.

Feral cat trying to make sense people and the world around