Tom Lane wrote:
True. It's not so much the speed as the fragility when faced with small
changes to formatting. In addition to whitespace, some clients mangle
punctuation with features like automatic "curly"-quoting.
Yeah. I was wondering whether encoding differences wouldn't be a huge
problem in practice, as well.
I'm not *too* worried about text encoding issues. In general it's very
obvious when text has been mangled due to bad encoding handling, and
it's extremely rare to see anything subtle like an app that transforms
accented chars to their base variants. Demangling strings damaged by bad
encoding handling is way out of scope, and sometimes not possible anyway.
I guess that UTF-8's delightful support for various composed and
decomposed forms of same glyph might be a problem. It's something I may
face in some other works I'm doing too, so I might have to see how hard
it'd be to put together a DB function that normalizes a UTF-8 string to
its fully composed variant. I don't think the decomposed forms see much
use in the wild though; they mostly come up as a security issue for
path/URL matching and the like.
http://unicode.org/reports/tr15/
http://msdn2.microsoft.com/en-us/library/ms776393(VS.85).aspx
http://earthlingsoft.net/ssp/blog/2006/07/unicode_normalisation
I don't know much about the CJK text representations, though, either in
Unicode or in other encodings like Big5 . I *hope* the Unicode
normalization rules will be enough there but I'm not sure.
All strings must be converted from their original encoding to utf-8 for
queries of course. That might be troublesome when using something like a
web form where it might be hard to know the encoding of the input text
(and where browser bugs are the rule rather than the exception) but it's
thankfully not necessary to cater to every weird and broken browser.
So in this case I don't think encodings will be *too* much trouble
unless alternate unicode normalization forms turn out to be more common
than I think they are.
--
Craig Ringer