postgres=# select to_tsvector('test text');
to_tsvector
---------------
'test text':1
(1 row)
Ok. that's related to
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
commit. Thomas pointed that it can be non-breakable space (0xa0) and that commit
assumes any character with C locale and multibyte encoding and > 0x7f is alpha.
To check theory, pls, apply attached patch.
If so, I'm confused, we can not assume that 0xa0 is a space symbol in any
multibyte encoding, even in Windows.
--
Teodor Sigaev E-mail: teodor@xxxxxxxxx
WWW: http://www.sigaev.ru/
*** ./contrib/tsearch2/wordparser/parser.c.orig Wed Mar 21 20:41:23 2007
--- ./contrib/tsearch2/wordparser/parser.c Wed Mar 21 21:10:39 2007
***************
*** 124,130 ****
--- 124,134 ----
* with C-locale is an alpha character
*/
if ( c > 0x7f )
+ {
+ if ( c == 0xa0 )
+ return 0;
return 1;
+ }
return isalnum(0xff & c);
}
***************
*** 157,163 ****
--- 161,171 ----
* with C-locale is an alpha character
*/
if ( c > 0x7f )
+ {
+ if ( c == 0xa0 )
+ return 0;
return 1;
+ }
return isalpha(0xff & c);
}