Using preg_match to find Japanese text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



PHP list,

While I'm only just learning about regular expressions in another thread, I still seem to be finding exceptional situations which have me questioning the extent to which preg expressions can be implemented.

(The following contains UTF-8 encoded Japanese text. Apologies if it comes out as ASCII gibberish.)

What I have are sentences that look like this:
気温 【きおん】 (n) atmospheric temperature; (P); EP
について (exp) concerning; along; under; per; KD

I want to divide the first line into three variables, $word, $reading, and $meaning. And I want to divide the second line into two variables, $word and $meaning.

If I can figure out how to extract the first variable, $word, then I can figure out the rest. But that first step seems to be a doozy.

The way I see it, I could do it two ways. One is to take out all the pull out all the characters up to the first occurrence of a space, and assume that it's Japanese. Not that I'm sure how to write that expression, but maybe I could.

But it seems like it would be a lot more sophisticated if I could determine if a word was Japanese by testing it's Unicode value or some similar method. That way I would be less vulnerable to slight variabilities in positioning of words in the source material.

Looking at all the multibyte related functions in the PHP manual, it seems there are options for testing the type of encoding, but not for the type of language or character set.
http://jp2.php.net/manual/en/ref.mbstring.php
However, I could be wrong about this (and it would be nice if I was).

Searching the web, I came across this guy's script to test if characters were above the usual ASCII range in Unicode, and could therefore be assumed to be Japanese:
http://www.randomchaos.com/documents/?source=php_and_unicode

But this seems unwieldy, as I think, if I understand it correctly, I'd have to test each individual word. I could use it to test if there was any Japanese at all in a string, but I'm not confident I could use it to extract words.

So I'm a little stuck. If anyone has any advice to help get me started, it would be much appreciated.

Thank you for your time and help.

--
Dave M G

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux