[resending, because somehow this got routed through my work address the first time.] Hi, I didn't have time to write a short note, so I wrote a long one instead. Sorry. On Mon, Dec 29, 2014 at 10:36:42PM +0000, Mike Cardwell wrote: > can't just encode it with punycode and then store the ascii result. For example, > these two are the same hostnames thanks to unicode case folding [1]: > > tesst.ëxämplé.com > teßt.ëxämplé.com Well, in IDNA2003 they're the same. In IDNA2008 (RFC 5890 and suite), they're not the same. In UTS46, they're kind of the same, because pre-lookup processing maps one of them to the other (it depends which mode you're in which way the mapping goes, which is just fantastic because you can't tell at the server which mode the client is in. IDNA is an unholy mess); but the lookup is still done using the IDNA2008 rules, approximately. > They both encode in punycode to the same thing: > > xn--tesst.xmpl.com-cib7f2a Under no circumstances should they encode to that. IDNA, either 2003 or 2008, is label by label, not for entire domain names. (The labels are the things between the dots in a presentation-format FQDN.) So, under IDNA2003, you should get tesst.xn--xmpl-loa7ai.com for both of them. Under IDNA2008, you should get that also for the first of them, and xn--tet-6ka.xn--xmpl-loa7ai.com for the second. Most clients, however, are probably going to want to do something UTS46ish, but there's no way to guarantee that the IDNA2003 and IDNA2008 labels go together (this is called "variants", and you do not want to know how much horror that has caused). Worse, of course, you also don't know that what you have isn't just raw UTF-8 in the label, but let's boil one ocean at a time. > Don't believe me, then try visiting any domain with two s's in, whilst replacing > the s's with ß's. E.g: > > ericßon.com > nißan.com > americanexpreß.com This depends entirely on which version of IDNA you're using. Many browsers right now officially do IDNA2003. Unfortunately, they don't _actually_ do that either because IDNA2003 is nominally nailed to Unicode version 3.2, and there's approximately zero chance that a running computer on the Internet these days is using such an old Unicode version. IDNA2008, by the way, wasn't something we did (I am one of the people you can blame for this) for fun. The very problem you're noting in IDNA2003 is one of the things we were trying to fix. Under IDNA2003, the Unicode-Punycode-Unicode round trip could lose data. Under IDNA2008, this is fixed: every A-label (the xn--Punycodehere version) corresponds to exactly one U-label (the Unicode representation) and conversely. (It follows from this that lots of Unicode strings aren't U-labels, because there are a lot of rules about what can be a U-label. For instance, capital letters aren't allowed, because they're not stable under caseFold. This is all in the IDNA2008 RFCs, but they're not an easy read. We tried.) > So if I pull out "xn--tesst.xmpl.com-cib7f2a" from the database, I've no idea > which of those two hostnames was the original representation. None. What you need to do is split the name on label boundaries (which is hard, because believe it or not "." is a valid character in the DNS, but splitting on the "." character is probably as good as you can do here. But look for escaped ones). Then you can check for validity under IDNA2008 and IDNA2003, and then you can run it through Punycode. ICANN, for whatever it's worth, is using IDNA2008 rules for top-level domains and as part of its IDNA guidelines, so over time the actually _registered_ names are going to be either IDNA2008 or, at worst, UTS46. > The trouble is, if I store the unicode representation of a hostname instead, > then when I run queries with conditions like: > > WHERE hostname='nißan.com' > > that wont pull out rows where hostname='nissan.com'. Right. If I were doing this, I think I'd probably create two functions, one to do IDNA2003 and one to do IDNA2008. Then I'd put a functional index on it for both cases. Eventually, you'll be able to drop the IDNA2003 lookup because everything will conform to IDNA2003. (Note that the WHATWG, which the W3C is going to listen to but which is not a W3C WG, appears to be trying to undo that; but IDNA2003 is irredeemably broken. So there is a mess here brewing.) > So the system I've settled with is storing both the originally supplied > representation, *and* the lower cased punycode encoded version in a separate > column for indexing/search. This seems really hackish to me though. Well, see above. The other way I'd do it is to store _both_ the IDNA2003 punycode and also the IDNA2008 A-label. The reason I hate this is because the lookup is insanely complicated. > It seems to me that Postgres would benefit from a native hostname type and/or > a pair of punycode encode/decode functions. A pair won't work. > And perhaps even a simple unicode case folding function. Unicode case folding is _way_ more complicated than you seem to be thinking here, and importantly has some nasty edge conditions. For instance, the natural uppercase of the lowercase sharp s, ß, that we've been talking about now turns out to be capital sharp S, ẞ (that's U+1E9E in case you can't see it). That is not, however, the uppercase, because the case folding rules in earlier versions of Unicode (which didn't have U+1E9E) was SS, and the stability rules require that things not break across versions. (There are other problems like this. For instance, the upper case of é in French is officially E, and in Québecois is officially É. And then there's the Turkic dotless-i and dotted-i rules.) To do case folding really according to what people expect, you need to be locale sensitive. Since the DNS has no locale information, we couldn't do that in IDNA, so the answer the first time was naïve case folding (along the lines of the Unicode standard caseFold file), and the second time to leave the case folding to the user agent, on the principle that it has a hope of knowing the locale. > With the end result that these return TRUE: > > unicode_case_fold('ß') = 'ss' But that's false. What's really going on there is that the Unicode case fold of ß is SS, and that case folded again is ss. > A native type would also be able to apply suitable constraints, e.g a maximum > length of 253 octets on a punycode-encoded trailing-dot-excluded hostname, a > limit of 1-63 octets on a punycode encoded label, no leading or trailing hyphens > on a label, etc. You seem to want a bunch of label constraints, not all of which are related to IDNA. I think it would be better to break these up into a small number of functions. As it happens, I have a colleague at Dyn who I think has some need of some of this too, and so it might be worth spinning up a small project to try to get generic functions: to_idna2003, to_idna2008, check_ldh, split_labels, and so on. If this seems possibly interesting for collaboration, let me know & I'll try to put together the relevant people. Best regards, A -- Andrew Sullivan ajs@xxxxxxxxxxxxxxx -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general