John G. Scudder wrote (on Fri 31-Aug-2012 at 17:24 +0100): > By the way, since this is now in *IETF* last call, if you want your > comments to be considered you should send them to ietf@xxxxxxxx. > Feel free to cc IDR and GROW if you like. OK. Message previously posted to idr@xxxxxxxx follows: ---------------------------------------------------------------------- -------- In trying to classify Errors with BGP-4 UPDATE Messages I think it would be useful to distinguish between the form of an error and the severity of that error and how BGP should respond. It seems to me that there are four severities/responses: 1) "Critical Error" -> drop/restart session or AFI/SAFI So the overall response then depends on how gracefully the session drop and/or restart can be handled. 2) "Serious Error" -> do something with NLRI, short of dropping the session. The "treat-as-withdraw" mechanism is mentioned. The requirements obviously do not wish to specify mechanisms. But I think that the requirements should address what outcome is expected if errors in an individual UPDATE message are to be limited to that message. I think what that means is: * it must be possible to identify all NLRI that the message could be carrying. * whatever is done with those NLRI must reflect the fact that the recipient has an incomplete, possibly empty, set of attributes for those NLRI. 3) "Ignorable Error" -> process the UPDATE message as if the ignored attributes had never existed. Some errors in some trivial attributes may be ignorable. The requirements could cover the criteria for being deemed trivial. Some errors in Optional Transitive may be dealt with by ignoring the attribute altogether. The requirements mention this, but do not specify criteria for being ignorable. 4) "Recoverable Error" -> process the UPDATE message which has had errors "patched up". The draft-ieft-idr-error-handling, for example, suggests that invalid Attribute Flags may simply be overwritten by the expected value. I would then divide the forms of error into (1) "framing" and (2) "content" (or "semantic"). A BGP UPDATE message has three levels of framing: * Level 1 -- the 16 octet "Marker" + Message Length + Withdrawn Routes Length + Total Path Attributes Length If the Message Length is broken, it is extremely likely that the "Marker" on the next message will be invalid. * Level 2(a) -- the Withdrawn Routes Each prefix must have a valid prefix length, and the last must run exactly to the end of this part of the message. * Level 2(b) -- the Attributes Each attribute must be correctly framed, and at the end of the attributes the last one must run to exactly the end of the attribute part of the message. * Level 2(c) -- the Network Layer Reachability Information. Same as 2(a). * Level 3 -- various Attributes Some attributes have internal framing. So far, so obvious. To judge if an individual attribute is properly framed, we need to consider the red-tape: * the Flags octet has a limited set of valid values, depending on the Type. * the Type may be more or less anything, but repeats are not valid. * the Length is constrained for some Types There is some redundancy here, more for known types than unknown ones, which helps. The Total Path Attributes Length is, effectively, a checksum for all the Lengths of all the Attributes. It would be possible to specify that a set of attributes should be deemed correctly framed solely on the basis of passing that test. However, my feeling is that all the available redundancy (such as it is) should be used to minimise the possibility of accepting a broken attributes -- *particularly* where an error is going to be treated as Ignorable. Once attributes are correctly framed, then one can consider their content. Wherever the line between framing and content is drawn, I think it helps to be clear about the distinction between them -- "framing" errors affect the attribute and the attributes around it, "content" errors affect only the attribute. The framing of an Optional Transitive is a special case. If the parser recognises an Optional Transitive, but its Length is not valid, what should the receiver do ? If the sender did not understand the Attribute, then the broken Length is a "content" issue. If the sender did understand it, then the broken Length is a "framing" issue. (It is a serious disappointment to me that the Partial bit does not help here. But even if it did, what if the sender made a mess of setting/clearing it !?) In section 2.1.2 the draft specifies a number of "Semantic BGP Errors", which includes many things which I would class as "framing" errors. This is all pretty low level stuff. I can hear an argument that the requirements document is not the place for this level of detail. However, without a more precise understanding of how broken attributes may be parsed, requirements for how to deal with them are hard to specify and to interpret. If NLRI were explicitly separate from the attributes, then if a set of attributes fails a strict "framing" check, then "treat-as-withdraw" (or equivalent) could be applied, reliably. This seems to me to be as safe as possible, short of dropping the session (which has its own safety issues). With NLRI mixed up in the attributes, either one plays safe and treats all attribute errors as Critical, or a much more detailed analysis of attribute parsing is required. What is the cost of missing some NLRI which were sent, but were obscured by some other broken attribute ? What is the risk ? What degree of broken-ness of an attribute can be deemed not to invalidate the parsing of the attributes before and/or after it ? Is that different for different attributes ? In order to contemplate classifying some attribute errors as "Ignorable" or "Recoverable", a more detailed analysis of attribute parsing is also required. An ATOMIC_AGGREGATE attribute is arguably trivial and Ignorable. But is an ATOMIC_AGGREGATE attribute with a length of 421 (say) likely to be a momentary lapse of concentration at the sender end, or more likely to be a symptom of a badly broken set of attributes ? Chris