2007. 03. 27, kedd keltezéssel 22.57-kor Tijnema ! ezt írta: > On 3/27/07, Zoltán Németh <znemeth@xxxxxxxxxxxxxx> wrote: > > 2007. 03. 27, kedd keltezéssel 15.06-kor William Lovaton ezt írta: > > > Hi there, > > > > > > I am trying to implement language detection with PHP for a web site I am > > > trying to build. The idea is to take a piece of text and try to guess > > > the language it is written in. > > > > > > I have two options but I'd like to know if you guys have a better idea. > > > > > > 1) I implemented a detector using spell checking, so if I run the text > > > through many spell checkers the one with less errors is probably the > > > right language for that text. It works quite well and I am pleased with > > > it. The only thing I don't like is that loading many spell checkers is > > > a bit of a waste, it may require a lot of CPU and a lot of memory > > > depending on the dictionary and the number of dictionaries you load. > > > Besides, it adds one extra module dependency (pspell). > > > > > > 2) The other option is implemented in PEAR and it's called > > > Text_LanguageDetect: > > > [] http://pear.php.net/package/Text_LanguageDetect > > > > > > It seems to use a very different technique called N-Gram-Based Text > > > Categorization, I haven't tested it yet but I will very soon and see how > > > good it works, it says it's in alpha state but I guess it doesn't > > > requiere pspell, doesn't consume a lot of memory and it should be fast. > > > The only thing I am worried about is how accurate is it... I'll check > > > soon and post my comments later. > > > > > > 3) <Insert a very good idea here, please> > > > > > > I'd really like to hear what different alternatives all of you have for > > > this problem. > > > > > > > I've definitely no experience with this problem, just guessing ;) > > > > what if you build some arrays of language specific stuff and check for > > that. I mean you could store stuff like "if it contains 's, 've, 'm many > > times it's probably english"... I don't really know how to store those > > rules, and I'm not sure they are good enough (or are there good enough > > rules) to tell several languages apart... > > > > greets > > Zoltán Németh > > In formal english, it's not allowed to use 've 'm etc, I'm should be > written as I am. So that's not gonna work i think. > But words like and are really english i think :) > Keep in mind that this is quite a hard way i think, but i don't have a > better solution. > Just for example, Dutch and Afrikaans are not very different, so it's > really hard to see which of the 2 the text is written in. > > Tijnema > > ps. If you can't get the difference between Dutch and Afrikaans, guess > for Dutch :) It's a lot more used then Afrikaans. yeah, looking for very frequently used words seems better idea. greets Zoltán Németh > > > > > > Thanks a lot, > > > > > > > > > -William > > > > > > > -- > > PHP General Mailing List (http://www.php.net/) > > To unsubscribe, visit: http://www.php.net/unsub.php > > > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php