Re: Language detection with PHP

William Lovaton <williama_lovaton@xxxxxxxxxxxxxx> · Thu, 29 Mar 2007 07:36:10 -0500

Hi,

Thanks to all of you who made suggestions.

Stayman, I was aware of many of the things you said in your post but I
wasn't aware of some details, thanks for being so specific.

In my original post I was rather simplistic in explaining my approach of
using spell checkers, it is in fact a little bit more compĺex than that.
I had into account the fact that for some languages people do not write
every word exactly in the right manner all the time, for example, is
normal for people to skip diacritical marks and for this reason my
library tries to be a little bit more clever: if a spell checking fails,
it asks the dictionary for a suggestion and remove all kind of marks
from both words and compare them, if they match then it's right.

The problem with this approach is that asking for a suggestion is
extremely slow and if you have to do that for every word that don't
check correctly, then it will be a lot slower.

Now, I tried the second option of using the PEAR class:
[] http://pear.php.net/package/Text_LanguageDetect

And it worked reasonably well, as I suspected it is very fast and it can
detect 52 different languages.  The only problem with it, as well as for
all of your suggestions, is that it needs a sample text long enough to
be accurate.  According to my tests it needs more than 10 or 20 words to
throw results more or less confident, but with longer samples it is very
accurate.  On the other side, my spell checking approach can be accurate
enough with very short samples, sometimes even with just one word.

A big win for the PEAR class is that it can be very accurate with a
sample text long enough and with very very bad spell checking, in this
scenario my spell checking approach would've failed miserably.  With
this I mean not only skipping diacritical marks but also skipping some
characters.

Maybe I will use a combination of both (the PEAR class and the spell
checker) when I need to detect a long sample or a short sample
respectively.

Thanks again for sharing your comments,

-William

El mié, 28-03-2007 a las 09:44 +0200, Satyam escribió:
> ----- Original Message ----- 
> From: "Zoltán Németh" <znemeth@xxxxxxxxxxxxxx>
> 
> >>
> >> In formal english, it's not allowed to use 've 'm etc, I'm should be
> >> written as I am. So that's not gonna work i think.
> >> But words like and are really english i think :)
> >> Keep in mind that this is quite a hard way i think, but i don't have a
> >> better solution.
> >> Just for example, Dutch and Afrikaans are not very different, so it's
> >> really hard to see which of the 2 the text is written in.
> >>
> >> Tijnema
> >>
> >> ps. If you can't get the difference between Dutch and Afrikaans, guess
> >> for Dutch :) It's a lot more used then Afrikaans.
> >
> > yeah, looking for very frequently used words seems better idea.
> >
> > greets
> > Zoltán Németh
> 
> In Spanish, as it happens with many languages that use diacritical marks, in 
> informal chatting you often skip them.  This has a long tradition in the 
> internet since years ago the support for those extra characters was 
> non-existent and today it is still somewhat patchy.  I used to have two 
> modes of writing in Spanish, formal writing with all proper accents, tilde 
> and umlauts and email mode, without any of those.  Nowadays, with support 
> for languages using the Roman alphabet widely available, there is no need to 
> omit diacritical marks, but you will often find them missing, particularly 
> in comments to blogs and other informal writing, just because of laziness or 
> carelessness or simply lack of formal education and in that I include 
> foreigners who more or less handle the language but not the minor details. 
> If English had accents, I would probably skip them.
> 
> So, using a spelling dictionary is not a good idea unless you can count your 
> input to be properly written.  A text in Spanish with its accents missing 
> will give you lots of errors, and we use just one sort of accent (acute) 
> plus tilde and umlaut.  The French use three sorts of accents, there is a 
> far higher chance of getting misspellings.  I don't know how abundant 
> accents are in Magyar, for me Zoltan Nemeth is the same as Zoltán Németh, 
> but the first is a misspelling.
> 
> This problem also affect the frequency of individual letters.  Should you 
> first convert accented vowels to their plain version?  Because if you find 
> accented letters, it is a sure sign that it is not English, but if there is 
> none, it doesn't mean it is English, it might be some non-English text 
> without the correct accents.   Should you count 'a' and 'á' separate or add 
> them together because people often omit the accent?
> 
> So, I also vote for the frequently used words approach and against the 
> lowest number of misspellings.  And I would first convert everything to 
> plain, with no accents, both for the needle and the haystack.
> 
> Satyam
> 
> PS: also, it is accepted practice to omit accents on uppercase letters such 
> as in headings.  It is not gramatically correct but a typographical 
> convention which the printing industry has been using for ages: the accents 
> simply don't fit nicely. 
> 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php