Search Postgresql Archives

Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Experts,

I have a table that contains bytea data which sometimes is UTF-8 text. 
When it is - and that is indicated by another column - I want to 
text-search index it.  Something like this:

db=> create index ix on tbl using gin (to_tsvector('english',body)) where is_utf8_text;

(Note my client encoding is UTF-8.)

That doesn't work because to_tsvector doesn't take bytea.  So I tried 
to_tsvector('english',body::text) ; that almost works, but CRLFs in the 
data become \015\012 (i.e. 6 characters) in the text and the tsvector 
contains things like '12hello'.

Next I tried to_tsvector('english',convert_from(body::text,'UTF-8')).  That 
doesn't work because convert_from is not immutable.  (This is 9.6; maybe that 
has changed.)  Is there a good reason for that?  Maybe because I might change 
the client encoding? As a hack I tried ALTER FUNCTION to make it immutable, 
and now I get:

ERROR:  invalid byte sequence for encoding "UTF8": 0x00

Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean?  But 
actually I'd be more than happy to ignore invalid UTF-8 here, since I'm 
only using it for text search; there may be some truly invalid UTF-8 
in the data.  Is there a "permissive" mode for charset conversion?

(That error also suggests that the convert_from is not optimising the 
conversion from UTF-8 to UTF-8 to a no-op.)

Anyway: given the problem of creating a text search index over bytea data 
that contains UTF-8 text, which may include oddities like null bytes, what 
would you do?

Thanks for any suggestions!

Phil.






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux