Re: Language detection with PHP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Em Terça 27 Março 2007 17:33, Zoltán Németh escreveu:
> 2007. 03. 27, kedd keltezéssel 15.06-kor William Lovaton ezt írta:
> > Hi there,
> >
> > I am trying to implement language detection with PHP for a web site I am
> > trying to build.  The idea is to take a piece of text and try to guess
> > the language it is written in.
> >
> > I have two options but I'd like to know if you guys have a better idea.
> >
> > 1) I implemented a detector using spell checking, so if I run the text
> > through many spell checkers the one with less errors is probably the
> > right language for that text.  It works quite well and I am pleased with
> > it.  The only thing I don't like is that loading many spell checkers is
> > a bit of a waste, it may require a lot of CPU and a lot of memory
> > depending on the dictionary and the number of dictionaries you load.
> > Besides, it adds one extra module dependency (pspell).
> >
> > 2) The other option is implemented in PEAR and it's called
> > Text_LanguageDetect:
> > [] http://pear.php.net/package/Text_LanguageDetect
> >
> > It seems to use a very different technique called N-Gram-Based Text
> > Categorization, I haven't tested it yet but I will very soon and see how
> > good it works, it says it's in alpha state but I guess it doesn't
> > requiere pspell, doesn't consume a lot of memory and it should be fast.
> > The only thing I am worried about is how accurate is it... I'll check
> > soon and post my comments later.
> >
> > 3) <Insert a very good idea here, please>
> >
> > I'd really like to hear what different alternatives all of you have for
> > this problem.
>
> I've definitely no experience with this problem, just guessing ;)
>
> what if you build some arrays of language specific stuff and check for
> that. I mean you could store stuff like "if it contains 's, 've, 'm many
> times it's probably english"... I don't really know how to store those
> rules, and I'm not sure they are good enough (or are there good enough
> rules) to tell several languages apart...
>
> greets
> Zoltán Németh
>
> > Thanks a lot,
> >
> >
> > -William

Good tip!! =]

Portuguese-Brazilian: ç, ã, õ, á, é, í, ó, ú, à, è, ì, ò, ù, ü

-- 
Davi Vidal
davividal@xxxxxxxxxxxxxxxx
davividal@xxxxxxxxx
--

Agora com fortune:
"Take a lesson from the whale; the only time he gets speared is when he
raises to spout."

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux