I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com They both encode in punycode to the same thing: xn--tesst.xmpl.com-cib7f2a Don't believe me, then try visiting any domain with two s's in, whilst replacing the s's with ß's. E.g: ericßon.com nißan.com americanexpreß.com So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea which of those two hostnames was the original representation. The trouble is, if I store the unicode representation of a hostname instead, then when I run queries with conditions like: WHERE hostname='nißan.com' that wont pull out rows where hostname='nissan.com'. Also, if I create a unique index on the hostname field, or even on lower(hostname), that wont stop the same hostname being inserted more than once, with slightly different representations. So the system I've settled with is storing both the originally supplied representation, *and* the lower cased punycode encoded version in a separate column for indexing/search. This seems really hackish to me though. It seems to me that Postgres would benefit from a native hostname type and/or a pair of punycode encode/decode functions. And perhaps even a simple unicode case folding function. With the end result that these return TRUE: unicode_case_fold('ß') = 'ss' 'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com') punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com' A native type would also be able to apply suitable constraints, e.g a maximum length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens on a label, etc. I initially created a pair of functions for using punycode using plperl and the IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled with duplicating the data to sacrifice disk space and complexity, for speed. I'm new to Postgres, and to this list, so if there is a better way for me to submit this suggestion or solve my problem, please point me in the right direction. [1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt Regards, -- Mike Cardwell https://grepular.com https://emailprivacytester.com OpenPGP Key 35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F XMPP OTR Key 8924 B06A 7917 AAF3 DBB1 BF1B 295C 3C78 3EF1 46B4
Attachment:
signature.asc
Description: Digital signature