Re: encoding question

Ivo Rossacher <rossacher@xxxxxxxxxx> · Tue, 21 Mar 2006 22:38:03 +0100

Am Dienstag, 21. März 2006 21.14 schrieb Ben K.:
> I just wanted to add that when I created the same database with -E
> SQL_ASCII on my linux box, the dump was loaded fine. I created another
> database without -E and observed the same invalid encoding problem.

This is not really surprising since SQL_ASCII does not check the coding unlike 
all other encodings.

>
> On the face value this seems to solve the problem at least superficially.

The more interesting question is, what is your application doing with the non 
ASCII characters within your database. The answer to this question will tell 
you what the correct contents would be.

>
> I'd like to check the data validity, and the easiest way seems to be to
> dump the data again from the linux box and compare with the original.

Your application defines what is valid. Even if you know that the dump would 
be the same it would not tell you anything about the validity of the data. So 
the better check would be to check with the application(s) connecting to both 
servers and work with some records which do contain non ASCII characters.
If both servers do give the same results with your application(s) you most 
possible got the coding right.

>
> Is there a way to compare between any two databases online? (like running
> a script checking row counts and schema) If I run crc on the concat of all
> fields in a row, and if the crc matches, would it be reasonably
> sufficient? Is there a stronger validation method?

Since any general method for comparing database contents (I don't know of such 
a tool) would use it's own drivers and setup, it will probably not get the 
same result as the test with your client applications.

The bottom line is that only a encoding set at the server level will make 
clear what the meaning of non ASCII characters is. The server can then deal 
with the conversion between the server and the client encoding so that the 
different clients can work even with different internal encodings.
With SQL_ASCII only the client application knows. This kind of setup needs a 
lot of care during setup to get consistent data, especially when several 
different applications are used.
The drawback of selecting an encoding is a little performance penalty. However 
in my databases I could not measure any difference. I have to say here that 
my data does not have a lot of strings in. So it is definitly not a good test 
case for this. Since there are several different clients with different 
languages using my databases, I do use unicode as encoding. This works 
without any problem for me.

Best regards
Ivo

>
>
> Thanks.
>
> Ben K.
> Developer
> http://benix.tamu.edu
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org