Re: return language of a word

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



shahrzad khorrami wrote:
hi all,

is there any function to return us the lanuage of a word in the sentence?

for example : My name is شهرزاد .

when it sees شهرزاد notice that is a persian language.


Thanks

How exactly would you expect this to work?
I can't think of any way to be 100% sure what language a word is. For example, the word "computer" is a word which one could say comes from a roman language (which is around half of the european languages), but it's also present in _many_ others. In a very general way you could figure out what type of language it is by checking the UTF-8 codetable.

Here's an overview:
U+0000 ... U+007F: Basic Latin
U+0080 ... U+00FF: Latin-1 Supplement
U+0100 ... U+017F: Latin Extended-A
U+0180 ... U+024F: Latin Extended-B
U+0250 ... U+02AF: IPA Extensions

U+02B0 ... U+02FF: Spacing Modifier Letters
U+0300 ... U+036F: Combining Diacritical Marks
U+0370 ... U+03FF: Greek and Coptic
U+0400 ... U+04FF: Cyrillic
U+0500 ... U+052F: Cyrillic Supplement
U+0530 ... U+058F: Armenian
U+0590 ... U+05FF: Hebrew
U+0600 ... U+06FF: Arabic
U+0700 ... U+074F: Syriac

U+0750 ... U+077F: Arabic Supplement
U+0780 ... U+07BF: Thaana
U+07C0 ... U+07FF: NKo
U+0900 ... U+097F: Devanagari
U+0980 ... U+09FF: Bengali
U+0A00 ... U+0A7F: Gurmukhi
U+0A80 ... U+0AFF: Gujarati
U+0B00 ... U+0B7F: Oriya
U+0B80 ... U+0BFF: Tamil

U+0C00 ... U+0C7F: Telugu
U+0C80 ... U+0CFF: Kannada
U+0D00 ... U+0D7F: Malayalam
U+0D80 ... U+0DFF: Sinhala
U+0E00 ... U+0E7F: Thai
U+0E80 ... U+0EFF: Lao
U+0F00 ... U+0FFF: Tibetan
U+1000 ... U+109F: Myanmar
U+10A0 ... U+10FF: Georgian

U+1100 ... U+11FF: Hangul Jamo
U+1200 ... U+137F: Ethiopic
U+1380 ... U+139F: Ethiopic Supplement
U+13A0 ... U+13FF: Cherokee
U+1400 ... U+167F: Unified Canadian Aboriginal Syllabics
U+1680 ... U+169F: Ogham
U+16A0 ... U+16FF: Runic
U+1700 ... U+171F: Tagalog
U+1720 ... U+173F: Hanunoo

U+1740 ... U+175F: Buhid
U+1760 ... U+177F: Tagbanwa
U+1780 ... U+17FF: Khmer
U+1800 ... U+18AF: Mongolian
U+1900 ... U+194F: Limbu
U+1950 ... U+197F: Tai Le
U+1980 ... U+19DF: New Tai Lue
U+19E0 ... U+19FF: Khmer Symbols
U+1A00 ... U+1A1F: Buginese

U+1B00 ... U+1B7F: Balinese
U+1B80 ... U+1BBF: Sundanese
U+1C00 ... U+1C4F: Lepcha
U+1C50 ... U+1C7F: Ol Chiki
U+1D00 ... U+1D7F: Phonetic Extensions
U+1D80 ... U+1DBF: Phonetic Extensions Supplement
U+1DC0 ... U+1DFF: Combining Diacritical Marks Supplement
U+1E00 ... U+1EFF: Latin Extended Additional
U+1F00 ... U+1FFF: Greek Extended

U+2000 ... U+206F: General Punctuation
U+2070 ... U+209F: Superscripts and Subscripts
U+20A0 ... U+20CF: Currency Symbols
U+20D0 ... U+20FF: Combining Diacritical Marks for Symbols
U+2100 ... U+214F: Letterlike Symbols
U+2150 ... U+218F: Number Forms
U+2190 ... U+21FF: Arrows
U+2200 ... U+22FF: Mathematical Operators
U+2300 ... U+23FF: Miscellaneous Technical

U+2400 ... U+243F: Control Pictures
U+2440 ... U+245F: Optical Character Recognition
U+2460 ... U+24FF: Enclosed Alphanumerics
U+2500 ... U+257F: Box Drawing
U+2580 ... U+259F: Block Elements
U+25A0 ... U+25FF: Geometric Shapes
U+2600 ... U+26FF: Miscellaneous Symbols
U+2700 ... U+27BF: Dingbats
U+27C0 ... U+27EF: Miscellaneous Mathematical Symbols-A

U+27F0 ... U+27FF: Supplemental Arrows-A
U+2800 ... U+28FF: Braille Patterns
U+2900 ... U+297F: Supplemental Arrows-B
U+2980 ... U+29FF: Miscellaneous Mathematical Symbols-B
U+2A00 ... U+2AFF: Supplemental Mathematical Operators
U+2B00 ... U+2BFF: Miscellaneous Symbols and Arrows
U+2C00 ... U+2C5F: Glagolitic
U+2C60 ... U+2C7F: Latin Extended-C
U+2C80 ... U+2CFF: Coptic

U+2D00 ... U+2D2F: Georgian Supplement
U+2D30 ... U+2D7F: Tifinagh
U+2D80 ... U+2DDF: Ethiopic Extended
U+2DE0 ... U+2DFF: Cyrillic Extended-A
U+2E00 ... U+2E7F: Supplemental Punctuation
U+2E80 ... U+2EFF: CJK Radicals Supplement
U+2F00 ... U+2FDF: Kangxi Radicals
U+2FF0 ... U+2FFF: Ideographic Description Characters
U+3000 ... U+303F: CJK Symbols and Punctuation

U+3040 ... U+309F: Hiragana
U+30A0 ... U+30FF: Katakana
U+3100 ... U+312F: Bopomofo
U+3130 ... U+318F: Hangul Compatibility Jamo
U+3190 ... U+319F: Kanbun
U+31A0 ... U+31BF: Bopomofo Extended
U+31C0 ... U+31EF: CJK Strokes
U+31F0 ... U+31FF: Katakana Phonetic Extensions
U+3200 ... U+32FF: Enclosed CJK Letters and Months

U+3300 ... U+33FF: CJK Compatibility
U+3400 ... U+4DBF: CJK Unified Ideographs Extension A
U+4DC0 ... U+4DFF: Yijing Hexagram Symbols
U+4E00 ... U+9FFF: CJK Unified Ideographs
U+A000 ... U+A48F: Yi Syllables
U+A490 ... U+A4CF: Yi Radicals
U+A500 ... U+A63F: Vai
U+A640 ... U+A69F: Cyrillic Extended-B
U+A700 ... U+A71F: Modifier Tone Letters

U+A720 ... U+A7FF: Latin Extended-D
U+A800 ... U+A82F: Syloti Nagri
U+A840 ... U+A87F: Phags-pa
U+A880 ... U+A8DF: Saurashtra
U+A900 ... U+A92F: Kayah Li
U+A930 ... U+A95F: Rejang
U+AA00 ... U+AA5F: Cham
U+AC00 ... U+D7AF: Hangul Syllables
U+D800 ... U+DB7F: High Surrogates

U+DB80 ... U+DBFF: High Private Use Surrogates
U+DC00 ... U+DFFF: Low Surrogates
U+E000 ... U+F8FF: Private Use Area
U+F900 ... U+FAFF: CJK Compatibility Ideographs
U+FB00 ... U+FB4F: Alphabetic Presentation Forms
U+FB50 ... U+FDFF: Arabic Presentation Forms-A
U+FE00 ... U+FE0F: Variation Selectors
U+FE10 ... U+FE1F: Vertical Forms
U+FE20 ... U+FE2F: Combining Half Marks

U+FE30 ... U+FE4F: CJK Compatibility Forms
U+FE50 ... U+FE6F: Small Form Variants
U+FE70 ... U+FEFF: Arabic Presentation Forms-B
U+FF00 ... U+FFEF: Halfwidth and Fullwidth Forms
U+FFF0 ... U+FFFF: Specials
U+10000 ... U+1007F: Linear B Syllabary
U+10080 ... U+100FF: Linear B Ideograms
U+10100 ... U+1013F: Aegean Numbers
U+10140 ... U+1018F: Ancient Greek Numbers

U+10190 ... U+101CF: Ancient Symbols
U+101D0 ... U+101FF: Phaistos Disc
U+10280 ... U+1029F: Lycian
U+102A0 ... U+102DF: Carian
U+10300 ... U+1032F: Old Italic
U+10330 ... U+1034F: Gothic
U+10380 ... U+1039F: Ugaritic
U+103A0 ... U+103DF: Old Persian
U+10400 ... U+1044F: Deseret

U+10450 ... U+1047F: Shavian
U+10480 ... U+104AF: Osmanya
U+10800 ... U+1083F: Cypriot Syllabary
U+10900 ... U+1091F: Phoenician
U+10920 ... U+1093F: Lydian
U+10A00 ... U+10A5F: Kharoshthi
U+12000 ... U+123FF: Cuneiform
U+12400 ... U+1247F: Cuneiform Numbers and Punctuation
U+1D000 ... U+1D0FF: Byzantine Musical Symbols

U+1D100 ... U+1D1FF: Musical Symbols
U+1D200 ... U+1D24F: Ancient Greek Musical Notation
U+1D300 ... U+1D35F: Tai Xuan Jing Symbols

Check it against that, and you'll be able to roughly figure out which language-group (not sure of the correct word for this) the characters, and thus the word, belong to.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux