Tom Lane <tgl@xxxxxxxxxxxxx> writes: > Peter Eisentraut <peter_e@xxxxxxx> writes: > > By the way, I have always been concerned about the feature of Unicode > > that you can write logically equivalent strings using different > > code-point sequences. Namely, you often have the option of writing an > > accented letter using the "legacy" single codepoint (like in ISO > > 8859-something) or alternatively using accept plus "base letter" as two > > code points. Collating systems should treat them the same, so hashing > > the byte values won't work anyway. This is a more extreme case of > > "tyty" vs. "tty" because using a proper rendering system, those Unicode > > strings should look the same to the naked eye. Therefore, I'm doubtful > > that using a binary comparison as tie-breaker is proper behavior. > > Hm. Would you expect that these sequences generate identical strxfrm > output? I think this is mixing up two different things. Using iso-8859-1 to encode "é" as a single byte versus using UTF8 which would take two bytes to encode it is an issue of using two *different* encodings. The actual string of characters being encoded is precisely the same string. That is, while the sequence of bytes in the encoded string is different the sequence of characters being encoded is precisely the same. Postgres doesn't really face this issue currently since it only supports one encoding at a time anyways. If Postgres supported multiple encodings and it was necessary to compare two strings in two different encodings then they would probably both have to be converted to a common encoding (presumably UTFx for some value of x) before comparing. There is a separate issue that some characters could theoretically have multiple representations even within the same encoding. This doesn't really happen in the usual non UTF encodings (like iso-8859-x) to my knowledge, but it can happen in UTF8 or UTF16 because, for example, you could use the variable length form that takes 2 bytes or even 4 bytes for characters that are really just plain ascii characters. However there are canonicalization rules that basically rule all but the shortest representation invalid unicode strings. I assume these exist precisely to make it easier to compare or hash unicode strings. I guess it's an open question whether the database should signal an error on such invalid strings or silently treat them as equivalent to a correct encoding of the same string. On the original issue I think the bottom line is that strings are sequences of characters and two sequences of characters should only compare equal if they contain the same characters in the same order. The encodings can be different and still represent the same string, but they do have to represent the same sequence of characters. If they represent two different sequences of characters -- even if the two sequences have the same significance in the language of the reader, they're still not actually the same sequence of characters. As long as both strings are encoded in the same encoding (whether that be iso-8859-1 or utf8 or whatever) sorting by strcoll and then strcmp will effectively give this set of semantics with one exception, the case of invalid UTF encodings that are not canonicalized where it will silently treat them as distinct strings from the correctly encoded string. One day when it's possible for the two strings to be in two different encodings then they will have to both be converted to an encoding that includes the union of the two character sets covered by the two encodings. -- greg