At Wed, 21 Dec 2011 17:43:23 -0800, Terry Manderson wrote: > > Apologies for my lack of attention to date on this topic, so speaking only > for myself here. Similar apologies for not having answered this more promptly. Somehow we missed seeing this until our AD asked us about it. Please see draft-ietf-sidr-rpki-rtr-25, just posted, which we hope addresses most of your concerns (there are a few points on which I think we're just going to have to agree to disagree). > Starting with the document structure, I see no reference to a set of > requirements. The introduction is rather vague, and if anywhere that is > where I would expect to see such a requirements description. This means for > the rest of document I found myself asking "why" on many levels. Motivation was discussed at some length in the SIDR WG. We hadn't thought it necessary to discuss this in the draft, but -25 adds a bit of text on this. In brief, though, for ietf@xxxxxxxx readers who weren't tracking the WG: the primary goal of this protocol was to make it possible to run BGP origin authentication based on RPKI data on currently shipping router hardware, rather than having to wait for bigger processors, crypto accelerators, etcetera. So we wanted a simple protocol that would let us do all the heavy lifting (RPKI data collection, certificate checking, computation of deltas from previous data, etc) somewhere off the router, and feed the router only what it needs. > When I got to the end of the document I felt that the protocol borders on a > wheel re-invention exercise. When you think about a router simply being a > client to a cache that is providing RIB access tokens for a route using a > mechanism that is a secure, stable, scalable, known (by both vendors and > operators), and is extensible, I'm more likely to swing to RADIUS in doing > such a service with nicely structured AV-Pairs and sane timers for > reauth/retry etc. Even the SME's know radius for their WPA enterprise kit. RADIUS doesn't have a bulk transfer operation, and bulk transfer of data is the main task of this protocol, particularly at start-up. You are certainly entitled to your opinion, but it comes a bit late. This work was done in the public view, with regular progress reports to the SIDR WG, and we have multiple interoperable implementations including several of the major router vendors. So, with all due respect, I don't think the folks who have put work into this will be all that interested in abandoning running code at this point. > Glossary: > > Global RPKI: > I disagree with this definition for two reasons. 1) I'm not aware of a > unified definition for 'distributed system' so this is all rather vague. The term has been used to describe DNS for decades. Also see: http://en.wikipedia.org/wiki/Distributed_computing > Perhaps you could say 'published at a disparate set of systems'. I don't find that any clearer. Readers who can't understand the words "distributed set" aren't likely to understand "disparate set" either. > 2) Limiting > the servers to be "at" the "IANA, RIRs, NIRs, and ISPs" is also premature. > It's not clear to me that these entities will run their own repositories, > nor are they going to be the only repository operators in the lifecycle of > the RPKI. This is essentially the same list as appears in section 1.1 of draft-ietf-sidr-arch, with the term "LIR" replaced by "ISP". I suppose we could add "or other service providers". > Cache: > The words surrounding the fetch/refresh mechanisms of the RPKI is limiting. > Both draft-ietf-sidr-repos-struct and draft-ietf-sidr-res-certs allow for > other (future) retrieval mechanisms as defined by the repository operator > beyond RSYNC (loosely documented in RFC5781). Terry, you've made it quite clear that you disagree with the SIDR WG's decision to make rsync the mandatory-to-implement RPKI retrieval protocol, but you lost that argument a long time ago, and I fail to see the point of bringing it up here yet again. > Last sentence. "Trusting this cache further is a matter between the provider > of the cache and a relying party". In my mind the Relying Party was the one > that did the RPKI validation - would this not be better stated as "Trusting > this cache further is a matter between the provider of the cache and the > router operator". If a router is making decisions based on data given to it by a server, the router is the relying party in that relationship. That the server in question was itself the relying party in another relationship does not change this. The picture here is not all that different from the way that some vendors have chosen to implement DNSSEC. It's a two-tier security relationship: an end-to-end relationship between the publisher of signed objects and the validator of those signed objects, then a separate security relationship between the entity that validated the signed objects and the end entity that actually uses the data. > Deployment Structure: > > Why repeat the definition of "Global RPKI"? It's superfluous. Because it's not a definition? I agree that the text here is similar to the definition, but this section is trying to describe the roles in the system. > Local Cache: Again. 'Relying party' seems to be borrowed from the > CA/identity world. Unless you redefine that term here it seems as if the > "router" is making RPKI validation decisions. Which it is not. The router is > acting more like a NAS (See Radius, 2865) when talking to a local cache. > > The definition of "routers" seems to get this right - eg "a client of the > cache". See above. "Relying party" is a security relationship term, not just a PKI term. > Operational Overview > > when you first use "ROA", please expand the TLA, and provide a reference. Done. > Serial Query > > I don't remember seeing a recommendation for how often a client (router) > sends a serial query. Is there a Min/Max? Surely doing it every second would > be excessive.. Maximum is covered in section 6.2: the router must send a Serial or Reset Query no less frequently than once per hour. Minimum is a good question. We had been assuming that, as this is an in-POP relationship with cache and router operated by the same party, there would likely be a knob in the router (router guys live for knobs) and setting it would be a matter of local policy. If you want your router to beat up your cache server every minute, who am I to stop you? We needed to set a maximum because that affects the architecture of the cache (how long does it need to hold onto old data -- given the potential size of the data sets involved, one might implement the cache very differently if one needed to hold old data for a week rather than an hour). > IPv4 Prefix: > > "and nothing prohibits the existence of two identical > route: or route6: objects in the IRR." > > Why even mention the IRR here? It just doesn't seem at all relevant. (and > isn't defined) Good catch. Done. > " IPvX PDUs" expand to IPv4 or IPv6. Globing into one is a misdirection > under a heading of 'IPv4 Prefix' > > IPv6 Prefix > > Some text here to say that the IPv6 data structure follows the same > semantics as the IPv4 data structure would be good.. or alternatively > restructure the document to Semantics, then describe the IPv4 and IPv6 data > structures as subheadings to Prefix PDUs. Done. > Error Report > > What is "excessive length" of a PDU? at what point do you say "o.k, now I > can truncate". Too long to be any valid PDU other than an Error Report. Done. > Fields of a PDU > > For all types, instead of using "ordinal" can you use the exact description > of the number? eg unsigned integer? For me I always relate ordinals to set > theory. Done. > PDU type, the e,g is incomplete shouldn't it be "IPv4 Prefix = 4" with a > forward reference to the IANA Considerations section? I think this is a matter of stylistic preference. > Serial Number. "for example via rcynic", Is not defined and implementation > specific! Please read the words "for example". I suppose we could add a reference, but the last time we did that somebody objected to having a reference pointing to the source code for a particular implementation. > and there is a typo "completing an rigorously validated"..while > there, consider why you use the term 'rigorously'.. Sigh. Next time, please be explicit about the typo you're seeing, our eyes repeatedly bounced off the "an" here until after we'd posting version -25. It's not worth yet another rev just to fix that. > are there situations when a validation is less rigorous? If so > explain. I suspect that my co-author was trying to say that one can't just retrieve the data, pull the ASNs and prefixes out of the ROAs, and feed them into the router, one has to do the RPKI validation first. I guess we can remove the word if it offends you, but it seems harmless. > Session ID > > What is the risk of a cache server starting/restarting with the same session > ID and serial number as before, but with different cache contents? Is this > an entropy concern? Just thinking of a potential scenario where a router is > cache-wedged. Is this at all probable? and why not - some words here to > cover this would be good. We added several paragraphs on exactly this topic sometime around IETF Last Call, I suspect the version you reviewed did not have that text. I think we've addressed this point, please check the current text and let us know if there's a further issue here. > Flags > > Can you reword the binary choice here? Do you actually need to delve into > 'right to announce'? This is really about RIB entry behaviors yeah? The semantics here are closely related to ROAs, which, as you no doubt recall, are Route Origin Authorizations, so the text here follows that model. With all due respect, I do not think that a discussion of RIB entry behavior here would be simpler. > Expand "IPvX". Done. > Start or Restart: > > I think the terms in when a router needs to send a serial query or a reset > query need to be tighter. Saying MAY here is too loose. I would much prefer > to see a structure where if the router does not have a recorded serial for a > cache from a previous session, the router MUST send a reset query. Logically > you assume that to be the case, so be specific. I think this is a stylistic matter again. The router MAY do two things here, one of which is only applicable if it has data from a previous broken session. The only real difference I see here between the current formulation and the MUST formulation you prefer is that, as currently written, the router could chose not to send anything at all initially; this option doesn't seem particularly useful, so I don't mind removing it, but neither do I see the difference between the current text and your suggested change as a big deal. > Thereafter the router MAY send a reset query, and SHOULD send a serial > query. I suspect this is what the vendors (who have chimed in on the list) > have coded. > > This then corroborates section 4 where you suggest the router only send > serial queries for efficiency. Section 6.2 already says that the typical exchange is for the cache to send a Serial Notify, in the expectation that the router will schedule an immediate Serial Query. We didn't make it any stronger than that because the folks implementing the router side of this expressed concern at the notion that the cache could tell them to do something (read: they understand that the notification mechanism will help speed convergence, but they're worried that the dinky CPUs they're stuck with in some of the relevant hardware will be swamped if they try too hard, which is why routers are allowed to ignore notifications and caches are rate-limited in sending them). > Transport: > > MiTM is Man in the middle as I and many others know it. 'Monkey/piggy/pickle > in the middle' is a child's ball game. Monkey-in-the-middle is a common non-sexist variant of this term. Welcome to the 21st century. > " Therefore, as of this document, there is no mandatory to > implement transport which provides authentication and integrity > protection." > > if this is the case.. then why? what is the gain? OK, this is the elephant in the living room. The basic problem is that the implementers and the IETF live on different planets. As discussed in section 7, it is pretty much impossible to find any channel security technology which is implemented on conventional servers (Linux, BSD, ...), is implemented on routers, and is acceptable to the IETF security folks. As further discussed in Section 7, the long term plan is TCP-AO, and there are people out there implementing that now, but it's going to be a few years before that's usable. In the meantime, we're stuck with ad-hoc pairings of what particular platforms support. Some routers support SSH clients, some don't. Some server platforms support TCP-MD5, some don't, the IETF doesn't like TCP-MD5, and at least one that sort-of-supports TCP-MD5 only supports it for incoming connections. Some routers support IPsec transit but can't terminate it. And so forth. It's a horrible mess. So, as discussed at some length in the SIDR WG, after talking both to people who knew the current router and server platforms and to the SIDR WG's security advisor, we came up with the compromise you see in the draft: the path forward is TCP-AO, but since we don't have that yet, there's this raft of other channel security mechanisms one is allowed to use for now. We expect to deprecate everything but TCP-AO once TCP-AO is readily available. Nobody is happy with this, but it's the least bad compromise we could find between what the IETF would prefer and reality in the field. > why not then make the router fetch the signed objects and do the > validation internal - this again seems to be the 'missing > requirements' problem. See "currently shipping routing hardware", above. > SSH Transport > > State up front that you MUST use SSHv2. (instead hinting in the third > paragraph) Done. > TLS Transport > "Man in The Middle (MiTM)" please. Above. > Router Cache setup > > "When a more preferred cache becomes available, if resources allow, it > would be prudent for the client to start fetching from that cache." > > How does the client (I assume router) know when to do this as cache's are > not synchronized?? How does a router tell if any particular cache has more > current data over another cache? what if two caches contradict each other? The document repeatedly states that the router has an ordered preference list of the caches it uses. The text you quote here doesn't say "has more current data", it says "becomes available", ie, it stops rejecting connection attempts, signalling errors, or otherwise failing to be useful. > Error codes > > 6: Withdrawal of Unknown Record (fatal), why drop the session? (which > presumably causes a restart) to a cache, assuming the cache is corrupt, > which will then send another Unknown Record, which is fatal... (repeat)?? > > Why not mark the cache as corrupt at the client? This is one of several loss-of-synchronization problems. The assumption is that the router may have (somehow) lost synchronization with the cache. We don't really know which party is confused at this point, all we know is that the session itself is no longer useful because the router and cache are not communicating clearly. So the router's data isn't necessarily corrupt. The router won't necessarily restart with this cache right away either, it has several options: it might try another cache, it switch to another set of data it has already loaded, or might try a reset query to this cache. > Security Considerations: > > Transport Security. There are multiple valid options for a root trust anchor > including the structure from the IAB aligning it to the IANA. Perhaps > instead of saying " the IANA root trust anchor" say "Global RPKI root trust > anchor". Otherwise you might accidently find your validated cache only > covers unallocated and reserved blocks. I think you're saying that using the term IANA here is politically incorrect. Thanks for the review! _______________________________________________ Ietf mailing list Ietf@xxxxxxxx https://www.ietf.org/mailman/listinfo/ietf