R: Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

Bianchi Quota Leonardo <leonardo.bianchiquota@xxxxxxxxx> · Wed, 19 Aug 2015 15:31:09 +0000

Hi, surely I will upgrade to 9.4.4! I already downloaded the rpms for the update to postgres 9.4.4 but I thought not to update before getting through this matter if update is not a prerequisite for the solution.

Answering to Tom's last post, I checked that Bugzilla 3.2 (an old installation of Bugzilla) was set to " Use UTF-8 (Unicode) encoding for all text in Bugzilla".

Today I did a test, trying to give more details, and I hope this can help to answer this question (which, if I understood well, is the point):
Does bugzilla regardless the database charset definition write data using UTF8?
(In the test I do stuff on Bugzilla 5.0 (the last stable release) instead of Bugzilla 3.2 (which is my running application) because for now I don't want to do tests in the production environment)
Then I think it would be very helpful to know if this behavior in general confirms Tom's thoughts.

---------------------TEST--------------------------------
On the new db, created in this way via psql: CREATE DATABASE bugsl9test with owner bugs ENCODING 'LATIN9' TEMPLATE template0 LC_COLLATE 'C' LC_CTYPE 'C';
I added two bugs. One setting bugzilla with "utf8":"0" and the other setting "utf8":"1" (1 means use utf8).
In both cases I wrote the char "è" in the field "Summary" of the web form. The result is that the value in the field of the short_desc column of "bugs"  table of the specific bug row, viewed via pgadminIII on a windows 7 is "Ãš" ,
but in the first case (Utf8:"0") bugzilla shows (I use chrome) for both of the two bugs an "Ã¨" and in the second case (utf8:"1") shows "Ãš" CORRECTLY as "è".
-----------------------------------------------------------

Actually the whole note about setting utf8 to "1" or to "0" is: "Use UTF-8 (Unicode) encoding for all text in Bugzilla. New installations should set this to true to avoid character encoding problems.
Existing databases should set this to true only after the data has been converted from existing legacy character encodings to UTF-8, using the contrib/recode.pl script."

Recode.pl (https://github.com/bugzilla/bugzilla/blob/master/contrib/recode.pl) is an utility which converts a database from one encoding (or multiple encodings) to UTF-8 and I, in a previous test, run recode.pl to convert the data dumped as latin9 (of course editing the "client_encoding" from latin9 to utf8) and then no "strange chars" were shown after restoring in the new utf8 database.

Thank you very much for your attention and patience!

Bye,
Leonardo

-----Messaggio originale-----
Da: Tom Lane [mailto:tgl@xxxxxxxxxxxxx]
Inviato: giovedì 13 agosto 2015 16:39
A: Martín Marqués
Cc: Adrian Klaver; Bianchi Quota Leonardo; 'pgsql-general@xxxxxxxxxxxxxx'
Oggetto: Re:  Chars problem restoring to ps 8.4 (utf8) a dumped db from ps 8.1 (latin9)

"=?UTF-8?Q?Mart=c3=adn_Marqu=c3=a9s?=" <martin.marques@xxxxxxxxx> writes:
> El 12/08/15 a las 11:12, Tom Lane escribió:
>> It does not seem likely to me that this would work at all.  You're
>> taking a dump file that is full of LATIN9 data and simply asserting
>> that it's
>> UTF8 data.  That doesn't make it so.  If it seemed to work, maybe
>> that's because your editor changed the encoding?  Not to be relied on, for sure.

> Well, IIRC a LATIN9 encoding char which is interpreted as UTF8 will
> get inserted with no error on a UTF8 server (although the final data
> will be bogus).

I'd believe the other way around: if you tell the database that you're using LATIN9, but what you send is really UTF8, it will not reject it because the individual bytes are perfectly valid LATIN9 characters and there are no cross-byte checks to make in LATIN9.  But it seems highly unlikely that LATIN9-encoded data would get past the UTF8 validity checker with any consistency.

It's possible that the problem is one of mislabeling, ie the database was claimed to use LATIN9 but what was actually sent was always UTF8.
If that was *always* the case then the OP's fix of changing the label in the dump file was actually the right thing to do.  But we haven't been given enough information to be sure of that --- and if that's what was happening, then some client software fixes would be in order anyway, because the client code was using the wrong client_encoding.

                        regards, tom lane
AVVISO DI RISERVATEZZA Informazioni riservate possono essere contenute nel messaggio o nei suoi allegati. Se non siete i destinatari indicati nel messaggio, o responsabili per la sua consegna alla persona, o se avete ricevuto il messaggio per errore, siete pregati di non trascriverlo, copiarlo o inviarlo a nessuno. In tal caso vi invitiamo a cancellare il messaggio ed i suoi allegati. Grazie. CONFIDENTIALITY NOTICE Confidential information may be contained in this message or in its attachments. If you are not the addressee indicated in this message, or responsible for message delivering to that person, or if you have received this message in error, you may not transcribe, copy or deliver this message to anyone. In that case, you should delete this message and its attachments. Thank you.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general