Re: 0xc3 error Text Search Windows French

Andrew <archa@xxxxxxxxxxxxxx> · Thu, 26 Jun 2008 04:49:36 +1000

Sorry one last detail.

All of my databases are in utf-8 format.  My Windows XP is en_AU and 
defaults to ISO-8859-1 character sets.  My postgresql.conf is set to the 
default for the client_encoding setting, which should then default to 
the database utf-8 format.

Andrew wrote:
One additional aspect.  I just ran the create text search dictionary 
command without the stopfile declaration using the OO dictionaries, 
and it worked fine with the select ts_lexize('public.fr_ispell', 
'catalogue'); command executing with no problems.  However, after 
creating an associated catalogue based on a copy of the 
pg_catalog.french catalogue, calls to ts_debug against my custom 
French config result in the 0xc3 error.  So it is looking like the 
problem is restricted to the parsing of the stop file.
I ran through the other out of the box supplied stemmers, which I have 
not touched in anyway and it is also occurring with the portuguese 
catalogue.

Cheers

Andy

Andrew wrote:
I have a feeling that an issue I'm running into is related to this: 
http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php

On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL 
8.3.0 or 8.3.3 DB, when attempting to do a:

select * from ts_debug('french', 'catalogue');

getting the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
HINT:  This error can also happen if the byte sequence does not match 
the encoding expected by the server, which is controlled by 
"client_encoding".
CONTEXT:  SQL function "ts_debug" statement 1

I have replaced the french.stop file with the one from the snowball 
web site 
(http://snowball.tartarus.org/algorithms/french/stemmer.html) to see 
if that would make any difference. But the same issue.  I have also 
attempted to load the French Hunspell dictionary from the Open Office 
web site (http://wiki.services.openoffice.org/wiki/Dictionaries), 
using the following command:

CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
   TEMPLATE = pg_catalog.ispell,
   DictFile = fr_FR,
   AffFile = fr_FR,
   StopWords = french
);

But getting the same error.  I have successfully loaded the English 
and Arabic dictionaries and an Arabic stop file I sourced from 
elsewhere, and they work fine with the various text search function 
calls, so it appears to be specifically related to a French character 
occurring in the stop file and the dictionaries.  To use the French 
OO dictionaries, I had to convert them from an ISO-8859-15 character 
set encoding to UTF-8.  As it still had the same result as with the 
packaged stop file when converting on Windows, I downloaded them and 
converted the encoding on a Linux machine before copying them across 
to windows to see if that would help, but it didn't.

However, if I run the ts_debug('french', 'catalogue'); against a 
Linux version of PostgreSQL 8.3.1, it works fine.  I have not tried 
version 8.3.1 on Windows.  While there are a lot more combinations to 
exhaust before I can make a categorical statement, at this stage it 
appears to be pointing towards an issue with the UTF-8 parser of 
PostgreSQL on Windows.

Is this an outstanding defect, or is there something that I'm doing 
wrong in my environment?  I have attempted to find anything related 
on the Internet, but other than the introductory reference, I have 
not found anything, which for what I would imagine to be, of the size 
of the French user base surprises me.  Hence, I'm thinking that 
perhaps it may be something in my environment causing the issue.  If 
others could also reproduce the error on their XP machines, that 
would indicate that the issue was not something specific just to me.

At this stage, it is not that important to me, as I'm just playing 
around with text search for my own curiosity and French was just a 
language I have randomly picked, along with Arabic (for which I'm 
lacking a snowball stemmer).  I don't actually read, much less speak 
those languages.  However, it would still be nice to have them working.

An additional related topic.  OO have for some languages, thesaurus 
files which are not in the same format as supported by Pg Full Text 
Search.  Are there any plans to support the OO thesaurus file 
formats?  They also have hyphenation files. Are there any plans to 
extend the current dictionary files to include hyphenation rules as 
captured in the OO hyphenation files?  I'm not sure how, if at all 
hyphenation rules would improve on indexing and searches, but I 
thought as the files exist, I would pose the question.

Thanks,

Andy