Search Postgresql Archives

Re: Duplicate Values or Not?!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greg Stark <gsstark@xxxxxxx> writes:

> Tom Lane <tgl@xxxxxxxxxxxxx> writes:
> 
> > If that does change the results, it indicates you've got strings which
> > are bytewise different but compare equal according to strcoll().  We've
> > seen this and other misbehaviors from some locale definitions when faced
> > with data that is invalid per the encoding the locale expects.
> 
> There are plenty of non-bytewise-identical strings that do legitimately
> compare equal in various locales. Does the hash code hash strxfrm or the
> original bytes?

Hm. Some experimentation shows that at least on glibc's locale definitions the
strings that I thought compared equal don't actually compare equal.
Capitalization, punctuation, white space, while they're basically ignored in
general in non-C locales do seem to compare non-equal when they're the only
differentiating factor.

Is this guaranteed by any spec? Or is counting on this behaviour unsafe?

If it's legal for strcoll to compare as equal two byte-wise different strings
then the hash function really ought to be calling strxfrm before hashing or
else it will be inconsistent. It doesn't seem to be doing so currently.

I find it interesting that Perl has faced this same dilemma and chose to
override the locale definition in this case. If the locale definitions
compares two strings equally then Perl does a bytewise comparison and uses
that to break ties. This guarantees non-bytewise-identical strings don't
compare eqal. I suspect they did it for a similar reason too, namely keeping
the semantics in sync with perl hashes.

Postgres could follow that model, I think it would solve any inconsistencies
just fine and not cause problems. However it would be visible to users which
may be considered a bug if the locale really does claim the strings are equal
but Postgres doesn't agree. On the other hand I think it would perform better
than a lot of extra calls to strxfrm since it would only rarely kick in with
an extra memcmp.

-- 
greg


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux