Search Postgresql Archives

Re: Initial ugly reverse-translator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tom Lane wrote:

True. It's not so much the speed as the fragility when faced with small changes to formatting. In addition to whitespace, some clients mangle punctuation with features like automatic "curly"-quoting.

Yeah.  I was wondering whether encoding differences wouldn't be a huge
problem in practice, as well.

I'm not *too* worried about text encoding issues. In general it's very obvious when text has been mangled due to bad encoding handling, and it's extremely rare to see anything subtle like an app that transforms accented chars to their base variants. Demangling strings damaged by bad encoding handling is way out of scope, and sometimes not possible anyway.

I guess that UTF-8's delightful support for various composed and decomposed forms of same glyph might be a problem. It's something I may face in some other works I'm doing too, so I might have to see how hard it'd be to put together a DB function that normalizes a UTF-8 string to its fully composed variant. I don't think the decomposed forms see much use in the wild though; they mostly come up as a security issue for path/URL matching and the like.

http://unicode.org/reports/tr15/
http://msdn2.microsoft.com/en-us/library/ms776393(VS.85).aspx
http://earthlingsoft.net/ssp/blog/2006/07/unicode_normalisation

I don't know much about the CJK text representations, though, either in Unicode or in other encodings like Big5 . I *hope* the Unicode normalization rules will be enough there but I'm not sure.

All strings must be converted from their original encoding to utf-8 for queries of course. That might be troublesome when using something like a web form where it might be hard to know the encoding of the input text (and where browser bugs are the rule rather than the exception) but it's thankfully not necessary to cater to every weird and broken browser.

So in this case I don't think encodings will be *too* much trouble unless alternate unicode normalization forms turn out to be more common than I think they are.

--
Craig Ringer


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux