Re: Hostnames, IDNs, Punycode and Unicode Case Folding

Andrew Sullivan <ajs@xxxxxxxxxxxxxxx> · Mon, 29 Dec 2014 19:22:21 -0500

[resending, because somehow this got routed through my work address the first time.]

Hi,

I didn't have time to write a short note, so I wrote a long one
instead.  Sorry.

On Mon, Dec 29, 2014 at 10:36:42PM +0000, Mike Cardwell wrote:
> can't just encode it with punycode and then store the ascii result. For example,
> these two are the same hostnames thanks to unicode case folding [1]:
> 
>   tesst.ëxämplé.com
>   teßt.ëxämplé.com

Well, in IDNA2003 they're the same.  In IDNA2008 (RFC 5890 and suite),
they're not the same.  In UTS46, they're kind of the same, because
pre-lookup processing maps one of them to the other (it depends which
mode you're in which way the mapping goes, which is just fantastic
because you can't tell at the server which mode the client is in.
IDNA is an unholy mess); but the lookup is still done using the
IDNA2008 rules, approximately.

> They both encode in punycode to the same thing:
> 
>   xn--tesst.xmpl.com-cib7f2a

Under no circumstances should they encode to that.  IDNA, either 2003
or 2008, is label by label, not for entire domain names.  (The labels
are the things between the dots in a presentation-format FQDN.)

So, under IDNA2003, you should get tesst.xn--xmpl-loa7ai.com for both
of them.  Under IDNA2008, you should get that also for the first of
them, and xn--tet-6ka.xn--xmpl-loa7ai.com for the second.  Most
clients, however, are probably going to want to do something UTS46ish,
but there's no way to guarantee that the IDNA2003 and IDNA2008 labels
go together (this is called "variants", and you do not want to know
how much horror that has caused).

Worse, of course, you also don't know that what you have isn't just
raw UTF-8 in the label, but let's boil one ocean at a time.

> Don't believe me, then try visiting any domain with two s's in, whilst replacing
> the s's with ß's. E.g:
> 
>   ericßon.com
>   nißan.com
>   americanexpreß.com

This depends entirely on which version of IDNA you're using.  Many
browsers right now officially do IDNA2003.  Unfortunately, they don't
_actually_ do that either because IDNA2003 is nominally nailed to
Unicode version 3.2, and there's approximately zero chance that a
running computer on the Internet these days is using such an old
Unicode version.

IDNA2008, by the way, wasn't something we did (I am one of the people
you can blame for this) for fun.  The very problem you're noting in
IDNA2003 is one of the things we were trying to fix.  Under IDNA2003,
the Unicode-Punycode-Unicode round trip could lose data.  Under
IDNA2008, this is fixed: every A-label (the xn--Punycodehere version)
corresponds to exactly one U-label (the Unicode representation) and
conversely.  (It follows from this that lots of Unicode strings aren't
U-labels, because there are a lot of rules about what can be a
U-label.  For instance, capital letters aren't allowed, because
they're not stable under caseFold.  This is all in the IDNA2008 RFCs,
but they're not an easy read.  We tried.)

> So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea
> which of those two hostnames was the original representation.

None.  What you need to do is split the name on label boundaries
(which is hard, because believe it or not "." is a valid character in
the DNS, but splitting on the "." character is probably as good as you
can do here.  But look for escaped ones).  Then you can check for
validity under IDNA2008 and IDNA2003, and then you can run it through
Punycode.

ICANN, for whatever it's worth, is using IDNA2008 rules for top-level
domains and as part of its IDNA guidelines, so over time the actually
_registered_ names are going to be either IDNA2008 or, at worst,
UTS46.

> The trouble is, if I store the unicode representation of a hostname instead,
> then when I run queries with conditions like:
> 
>   WHERE hostname='nißan.com'
> 
> that wont pull out rows where hostname='nissan.com'.

Right.  If I were doing this, I think I'd probably create two
functions, one to do IDNA2003 and one to do IDNA2008.  Then I'd put a
functional index on it for both cases.  Eventually, you'll be able to
drop the IDNA2003 lookup because everything will conform to IDNA2003.
(Note that the WHATWG, which the W3C is going to listen to but which
is not a W3C WG, appears to be trying to undo that; but IDNA2003 is
irredeemably broken.  So there is a mess here brewing.)  

> So the system I've settled with is storing both the originally supplied
> representation, *and* the lower cased punycode encoded version in a separate
> column for indexing/search. This seems really hackish to me though.

Well, see above.  The other way I'd do it is to store _both_ the
IDNA2003 punycode and also the IDNA2008 A-label.  The reason I hate
this is because the lookup is insanely complicated.

> It seems to me that Postgres would benefit from a native hostname type and/or
> a pair of punycode encode/decode functions.

A pair won't work.

> And perhaps even a simple unicode case folding function.

Unicode case folding is _way_ more complicated than you seem to be
thinking here, and importantly has some nasty edge conditions.  For
instance, the natural uppercase of the lowercase sharp s, ß, that
we've been talking about now turns out to be capital sharp S, ẞ
(that's U+1E9E in case you can't see it).  That is not, however, the
uppercase, because the case folding rules in earlier versions of
Unicode (which didn't have U+1E9E) was SS, and the stability rules
require that things not break across versions.  (There are other
problems like this.  For instance, the upper case of é in French is
officially E, and in Québecois is officially É.  And then there's the
Turkic dotless-i and dotted-i rules.)

To do case folding really according to what people expect, you need to
be locale sensitive.  Since the DNS has no locale information, we
couldn't do that in IDNA, so the answer the first time was naïve case
folding (along the lines of the Unicode standard caseFold file), and
the second time to leave the case folding to the user agent, on the
principle that it has a hope of knowing the locale.

>  With the end result that these return TRUE:
> 
>   unicode_case_fold('ß') = 'ss'

But that's false.  What's really going on there is that the Unicode
case fold of ß is SS, and that case folded again is ss.  

> A native type would also be able to apply suitable constraints, e.g a maximum
> length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a
> limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens
> on a label, etc.

You seem to want a bunch of label constraints, not all of which are
related to IDNA. I think it would be better to break these up into a
small number of functions.  As it happens, I have a colleague at Dyn
who I think has some need of some of this too, and so it might be
worth spinning up a small project to try to get generic functions:
to_idna2003, to_idna2008, check_ldh, split_labels, and so on.  If this
seems possibly interesting for collaboration, let me know & I'll try
to put together the relevant people.

Best regards,

A

-- 
Andrew Sullivan
ajs@xxxxxxxxxxxxxxx

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general