Re: Language detection with PHP

Zoltán Németh <znemeth@xxxxxxxxxxxxxx> · Tue, 27 Mar 2007 23:02:19 +0200

2007. 03. 27, kedd keltezéssel 22.57-kor Tijnema ! ezt írta:
> On 3/27/07, Zoltán Németh <znemeth@xxxxxxxxxxxxxx> wrote:
> > 2007. 03. 27, kedd keltezéssel 15.06-kor William Lovaton ezt írta:
> > > Hi there,
> > >
> > > I am trying to implement language detection with PHP for a web site I am
> > > trying to build.  The idea is to take a piece of text and try to guess
> > > the language it is written in.
> > >
> > > I have two options but I'd like to know if you guys have a better idea.
> > >
> > > 1) I implemented a detector using spell checking, so if I run the text
> > > through many spell checkers the one with less errors is probably the
> > > right language for that text.  It works quite well and I am pleased with
> > > it.  The only thing I don't like is that loading many spell checkers is
> > > a bit of a waste, it may require a lot of CPU and a lot of memory
> > > depending on the dictionary and the number of dictionaries you load.
> > > Besides, it adds one extra module dependency (pspell).
> > >
> > > 2) The other option is implemented in PEAR and it's called
> > > Text_LanguageDetect:
> > > [] http://pear.php.net/package/Text_LanguageDetect
> > >
> > > It seems to use a very different technique called N-Gram-Based Text
> > > Categorization, I haven't tested it yet but I will very soon and see how
> > > good it works, it says it's in alpha state but I guess it doesn't
> > > requiere pspell, doesn't consume a lot of memory and it should be fast.
> > > The only thing I am worried about is how accurate is it... I'll check
> > > soon and post my comments later.
> > >
> > > 3) <Insert a very good idea here, please>
> > >
> > > I'd really like to hear what different alternatives all of you have for
> > > this problem.
> > >
> >
> > I've definitely no experience with this problem, just guessing ;)
> >
> > what if you build some arrays of language specific stuff and check for
> > that. I mean you could store stuff like "if it contains 's, 've, 'm many
> > times it's probably english"... I don't really know how to store those
> > rules, and I'm not sure they are good enough (or are there good enough
> > rules) to tell several languages apart...
> >
> > greets
> > Zoltán Németh
> 
> In formal english, it's not allowed to use 've 'm etc, I'm should be
> written as I am. So that's not gonna work i think.
> But words like and are really english i think :)
> Keep in mind that this is quite a hard way i think, but i don't have a
> better solution.
> Just for example, Dutch and Afrikaans are not very different, so it's
> really hard to see which of the 2 the text is written in.
> 
> Tijnema
> 
> ps. If you can't get the difference between Dutch and Afrikaans, guess
> for Dutch :) It's a lot more used then Afrikaans.

yeah, looking for very frequently used words seems better idea.

greets
Zoltán Németh

> 
> >
> > > Thanks a lot,
> > >
> > >
> > > -William
> > >
> >
> > --
> > PHP General Mailing List (http://www.php.net/)
> > To unsubscribe, visit: http://www.php.net/unsub.php
> >
> >

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php