Statistical text analysis
How to detect a language? A submitted text is cut into words, and each word is cut further into small pieces called n-grams. More exactely, an n-gram is a sub sequence of n items from a given sequence.
For example, 13 different n-grams of up to 4 letters (n=4) can be extracted from the word "hello":
| h, e, l, o | -> | unigram ("l" appears twice) |
| he, el, ll, lo | -> | bigram |
| hel, ell, llo | -> | trigram |
| hell, ello | -> | 4-gram |
With the exception of the unigram "l", all n-grams appear only once (that is: "l" is the most frequent n-gram of the word "hello").
A working detection needs more than one word, of course. In most cases, ten to twenty words will be enough. The more words that are submitted for detection, the clearer the n-gram pattern becomes (some n-grams will appear many times while others don't).
And that's exactely what it's all about statistical text analysis:
For each language listed on this website, there's a so-called finger-print, a chart with the most frequent n-grams in descending order, where the underscore marks a white space, i.e. the start/terminal postion of an n-gram inside a word.
The top 5 n-grams (all of them are unigrams) for some western languages are:
| finger-print | Top 5 frequent n-grams | |
|---|---|---|
| German | -> | e, n, i, r, t |
| English | -> | e, t, o, n, i |
| French | -> | e, i, s, a, n |
| Spanish | -> | e, a, o, s, n |
| Portuguese | -> | e, a, o, s, r |
The most commonly used letter of the alphabet in all five languages is the letter "e". These few n-grams also let you assume that Spanish and Portuguese are similar, or that Romance languages favor the use of "s" over "t".
A statistical text analysis, however, goes much deeper as a finger-print contains several hundreds of the most frequent n-grams (this analysis here uses 400), and a distinction is made whether the n-gram appears at the beginning, in the middle or at the end of a word (white space -> underscore).
After all the n-grams of your text, too, have been put into a chart and sorted by frequency, finally, they are compared to each finger-print.
The Comparison looks at the different ranking positions for a specific n-gram. Let's say the most frequent n-gram of your text is the unigram "e". This would mean a perfect match in ranking to all the languages in the table above (since their finger-prints have the "e" in the top spot as well).
If the 2nd most frequent n-gram of your text is a "n" it then will be ,again, a perfect match to German, but to English where "n" is in the 4th place, there is a difference of 2 rankings, or- as you may call it- two "Δ-points". To French and Spanish, 3 Δ-points will be assinged, and to Hebrew, Arabic or Chinese - where the unigram "n" does not exist - a maximum of 400 Δ-points will be assigned (the total number/ranks of n-grams in a finger-print).
If you compare all the n-grams of your text (or at least the most frequent ones) to each fingerprint, the total number of Δ-points can be assigned to every language. The smaller the amount of Δ-points, the closer the language is related to your text. Hence, the language with the least Δ-points should therefor be the language of your submitted text.
More detailed information with code and demo script available on this website.