(This is a cross post from Stack Exchange, not getting much traction there) On my Mac install of PG: ``` =# select to_tsvector('english', 'abcd สวัสดี'); to_tsvector ------------- 'abcd':1 (1 row) =# select * from ts_debug('hello สวัสดี'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello} blank | Space symbols | สวัสดี | {} | | (2 rows) ``` On my Linux install of PG: ``` =# select to_tsvector('english', 'abcd สวัสดี'); to_tsvector ------------------- 'abcd':1 'สวัสดี':2 (1 row) =# select * from ts_debug('hello สวัสดี'); alias | description | token | dictionaries | dictionary | lexemes -----------+-------------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello} blank | Space symbols | | {} | | word | Word, all letters | สวัสดี | {english_stem} | english_stem | {สวัสดี} (3 rows) ``` So something is clearly different about the way the tokenisation is defined in PG. My question is, how do I figure out what is different and how do I make my mac install of PG work like the Linux one? On both installs: ``` # SHOW default_text_search_config; default_text_search_config ---------------------------- pg_catalog.english (1 row) # show lc_ctype; lc_ctype ------------- en_US.UTF-8 (1 row) ``` So somehow this mac install thinks that thai letters are spaces... how do I debug this and fix the "Space Symbol" definition here. Interestingly this install works with Armenian, but falls over when we reach Hebrew ``` =# select * from ts_debug('ԵԵԵ'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------------+-------+----------------+--------------+--------- word | Word, all letters | ԵԵԵ | {english_stem} | english_stem | {եեե} (1 row) =# select * from ts_debug('אאא'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+-------+--------------+------------+--------- blank | Space symbols | אאא | {} | | (1 row) ```