intro | start | example | explain | code | demo | test-unicode | more

Statistical text analysis

How to detect a language? A submitted text is cut into words, and each word is cut further into small pieces called n-grams. More exactely, an n-gram is a sub sequence of n items from a given sequence.

For example, 13 different n-grams of up to 4 letters (n=4) can be extracted from the word "hello":

 h, e, l, o -> unigram ("l" appears twice)
 he, el, ll, lo ->bigram
 hel, ell, llo -> trigram
 hell, ello -> 4-gram

With the exception of the unigram "l", all n-grams appear only once (that is: "l" is the most frequent n-gram of the word "hello").

A working detection needs more than one word, of course. In most cases, ten to twenty words will be enough. The more words that are submitted for detection, the clearer the n-gram pattern becomes (some n-grams will appear many times while others don't).

And that's exactely what it's all about statistical text analysis:

For each language listed on this website, there's a so-called finger-print, a chart with the most frequent n-grams in descending order, where the underscore marks a white space, i.e. the start/terminal postion of an n-gram inside a word.

The top 5 n-grams (all of them are unigrams) for some western languages are:

finger-print Top 5 frequent n-grams
 German -> e, n, i, r, t
 English ->e, t, o, n, i
 French -> e, i, s, a, n
 Spanish -> e, a, o, s, n
 Portuguese -> e, a, o, s, r

The most commonly used letter of the alphabet in all five languages is the letter "e". These few n-grams also let you assume that Spanish and Portuguese are similar, or that Romance languages favor the use of "s" over "t".

A statistical text analysis, however, goes much deeper as a finger-print contains several hundreds of the most frequent n-grams (this analysis here uses 400), and a distinction is made whether the n-gram appears at the beginning, in the middle or at the end of a word (white space -> underscore).

After all the n-grams of your text, too, have been put into a chart and sorted by frequency, finally, they are compared to each finger-print.

The Comparison looks at the different ranking positions for a specific n-gram. Let's say the most frequent n-gram of your text is the unigram "e". This would mean a perfect match in ranking to all the languages in the table above (since their finger-prints have the "e" in the top spot as well).

If the 2nd most frequent n-gram of your text is a "n" it then will be ,again, a perfect match to German, but to English where "n" is in the 4th place, there is a difference of 2 rankings, or- as you may call it- two "Δ-points". To French and Spanish, 3 Δ-points will be assinged, and to Hebrew, Arabic or Chinese - where the unigram "n" does not exist - a maximum of 400 Δ-points will be assigned (the total number/ranks of n-grams in a finger-print).

If you compare all the n-grams of your text (or at least the most frequent ones) to each fingerprint, the total number of Δ-points can be assigned to every language. The smaller the amount of Δ-points, the closer the language is related to your text. Hence, the language with the least Δ-points should therefor be the language of your submitted text.

More detailed information with code and demo script available on this website.



  Afrikaans
  Albanian
  Alemannic
  Amharic
  Arabic
  Armenian
  Basque
  Belarusian
  Bosnian
  Breton
  Bulgarian
  Catalan
  Chinese
  Croatian
  Czech
  Danish
  Dutch
  English
  Esperanto
  Estonian
  Finnish
  French
  Frisian
  Georgian
  German
  Greek
  Hawaian
  Hebrew
  Hindi
  Hungarian
  Icelandic
  Indonesian
  Irish_gaelic
  Italian
  Japanese
  Korean
  Latin
  Latvian
  Lithuanian
  Malay
  Manx
  Marathi
  Middle_frisian
  Mingo_iroquois
  Nepali
  Norwegian
  Persian_farsi
  Polish
  Portuguese_brazil
  Portuguese_europe
  Quechua
  Romanian
  Rumantsch
  Russian
  Sanskrit
  Scots
  Scots_gaelic
  Serbian
  Serbian_cyrillic
  Slovak
  Slovenian
  Spanish
  Swahili
  Swedish
  Tagalog
  Tamil
  Thai
  Turkish
  Ukrainian
  Vietnamese
  Welsh
  Yiddish