Readability score in local languages

Dieu_HOA Nguyen
4 min readJul 22, 2020

--

When writing content, not because you write correct grammar, audience will easily digest it. Other factors rather than grammar contributes to readable capacity too. Some of them named length and complication of sentences. To quantify the factors is a bit of tricky. It is hence readability score comes to spotlight.

Source: self drawing.
  1. What is readability score?

So what is readability score? It is number telling you how easy/hard for someone to read a particular writing. If the score is high, people likely find it easy to understand your sentences. Reversely, people still understand but takes time to digest complex words and sentences.

2. Readability in English language

Research of readability score is well developed in english language. Some of formula includes: Flesch Reading Ease,Flesch-Kincaid Grade Level, Gunning Fog Index etc (check here for more detail of formula). Their formula differs each other but share some common input such as: Number of words, number of sentence, complex words, polly-syllable words. However, when the input might not be compatible for other languages. Some of research focuses on developing formula for local language but as I know little is well recognized. In this article I will walk you through e some papers developing this score in local language. I also implement a package in python named "ReadabilityLola" for using in python. Check out my repo for more detail.

3. Malaysian language

Nur Amalina Mohamad Hazawawi, Mohd Hafiz Zakaria and Syariffanor Hisham develop SPIKE (Sistem Penilaian Kebolehbacaan Bahasa Melayu) score to measure readabilty of Bahasa. They claimed that “Several articles from Malay newspaper and education magazines were sampled, then analysed to match the suggested level of reader competencies from the highest to the lowest readability level. Results show that SPIKE can benefit general public particularly people with reading difficulties including dyslexics in measuring their reading competencies and to check whether a reading material is suited for their age.”

They proposed a formula measuring reading proficiency in Malay text:

Yi = a + b*ni + c*(di+ ki)

Where:

i : sample in case study

ni: 300/S. Number of words in sentences. S is number of sentence in 300 words. In the implementation, I modify ni as number of sentence/Number of total words.

di: Number of syllables. In the implementation, I modify di = 300* number of syllables/number of words

ki: potential difficult words including Kata Ganda,Diftong, Kata Pinjaman and Kekeliruan huruf. Specifically ki = (Kata Ganda+Diftong+Kata Pinjaman+Kekeliruan Huruf)*5. In the implementation I modify ki = 300*(Kata Ganda+Diftong+Kata Pinjaman+Kekeliruan Huruf)*5/Number of words

a = -13.988, b = 0.3793, c = 0.0207

The output will be approximate as the education grade level to absorb the text.

4. Hindi language

Manjira Sinha, Sakshi Sharma, Tirthankar Dasgupta and Anupam Basu (2012) develop several model to represent readability score in Bangla and Hindi. Some of features considered are: average sentence length, average word length, average number of syllables per word, number of poly-syllabic word per 30 sentences etc. They claimed that "Sixteen Hindi and sixteen Bangla texts are selected for the experiment (11 texts) and validation (5 texts) purpose. They cover a broad range of documents types starting from new paper article, short stories, interviews, and blogs to philosophical articles. So we can generalize the model for a variety of text types. Excerpts of length varying from 400 to 1000 words are chosen randomly from the texts to examine the parameters responsible for text readability in case of short as well as long documents. The texts are numbered from 1 to 16 arbitrarily and henceforth will be referred by the text number only."

After analyzing the result of 6 models per each language, they conclude the 2 most effective models for Hindi as:

-2.34+2.14*A WL+0.01*PSW (1)

0.211+1.37*A WL+.005*JUK (2)

Where:

AWL: average word length

PSW: Poly-syllabic words are the words whose count of syllable exceeds 2.

JUK: jukta-akshar or consonant-conjunct is consonants occuring together in clusters.

The 2nd model performs a bit better than the (1) one. However, due to the difficulty of calculating JUK, my package implements the (1) formula instead of.

The paper selects text ranging from 400 to 1000 words. So I modify the formula of PSW to adapt. In case content has more than 1000 words, PSW will be 1000*Number of poly-syllabic words/number of words. If content has less than 4000 words, PSW will be 400* number of poly-syllabic words/number of words. Else, PSW remains as number of poly-syllabic words/number of words.

5. Code & package

Regarding to implementation formula, I have implemented package and indexed in Pypi named ReadabilityLola. You can pip install ReadabilityLola as any python package. Check out my code in repo. In the future, I might implement score for more local languages if I find any.

--

--

Dieu_HOA Nguyen
Dieu_HOA Nguyen

Written by Dieu_HOA Nguyen

Feral cat trying to make sense people and the world around

No responses yet