Hostnames, IDNs, Punycode and Unicode Case Folding

Mike Cardwell <pgsql@xxxxxxxxxxxxxxxxxx> · Mon, 29 Dec 2014 22:36:42 +0000

I'd like to store hostnames in a postgres database and I want to fully support
IDNs (Internationalised Domain Names)

I want to be able to recover the original representation of the hostname, so I
can't just encode it with punycode and then store the ascii result. For example,
these two are the same hostnames thanks to unicode case folding [1]:

  tesst.ëxämplé.com
  teßt.ëxämplé.com

They both encode in punycode to the same thing:

  xn--tesst.xmpl.com-cib7f2a

Don't believe me, then try visiting any domain with two s's in, whilst replacing
the s's with ß's. E.g:

  ericßon.com
  nißan.com
  americanexpreß.com

So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
which of those two hostnames was the original representation.

The trouble is, if I store the unicode representation of a hostname instead,
then when I run queries with conditions like:

  WHERE hostname='nißan.com'

that wont pull out rows where hostname='nissan.com'. Also, if I create a unique
index on the hostname field, or even on lower(hostname), that wont stop the
same hostname being inserted more than once, with slightly different
representations.

So the system I've settled with is storing both the originally supplied
representation, *and* the lower cased punycode encoded version in a separate
column for indexing/search. This seems really hackish to me though.

It seems to me that Postgres would benefit from a native hostname type and/or
a pair of punycode encode/decode functions. And perhaps even a simple unicode
case folding function. With the end result that these return TRUE:

  unicode_case_fold('ß') = 'ss'

  'xn--tesst.xmpl.com-cib7f2a' = punycode_encode('teßt.ëxämplé.com')

  punycode_decode('xn--tesst.xmpl.com-cib7f2a') = 'tesst.ëxämplé.com'

A native type would also be able to apply suitable constraints, e.g a maximum
length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
on a label, etc.

I initially created a pair of functions for using punycode using plperl and the
IDNA::Punycode Perl module, but the performance wasn't good enough, so I settled
with duplicating the data to sacrifice disk space and complexity, for speed.

I'm new to Postgres, and to this list, so if there is a better way for me to
submit this suggestion or solve my problem, please point me in the right
direction.

[1] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Regards,

-- 
Mike Cardwell  https://grepular.com https://emailprivacytester.com
OpenPGP Key    35BC AF1D 3AA2 1F84 3DC3   B0CF 70A5 F512 0018 461F
XMPP OTR Key   8924 B06A 7917 AAF3 DBB1   BF1B 295C 3C78 3EF1 46B4
Attachment:
signature.asc

Description: Digital signature