Rob Shakir wrote (on Fri 31-Aug-2012 at 19:00 +0100): ... > Thanks for this detailed analysis. It is akin to something that > Alton Lo and I worked through whilst defining the critical and > semantic error types (and suggested inclusions). > > If you'll forgive me for responding to some particular points, I > feel this might aid the discussion and positioning. I have added > comments in-line marked [rjs]. > > On 31 Aug 2012, at 09:02, Chris Hall wrote: > > > This is all pretty low level stuff. I can hear an argument that > > the requirements document is not the place for this level of > > detail. However, without a more precise understanding of how > > broken attributes may be parsed, requirements for how to deal > > with them are hard to specify and to interpret. > [rjs]: What this draft intends to do is provide expectations, > requirements and context for error handling in BGP-4, based on > current deployments (and operator's experience). It also puts > forwards requirements for how each type of error is reacted to in a > broad sense. Essentially, where it came from is defining why > amending the error handling behaviour is required, and providing a > framework against which we can hang the different developments that > are being discussed in IDR, such that they meet the operational > challenges that come from amending this behaviour and form a > complete set of solutions to meet the problem space. > > [rjs]: I think the error handling solutions draft (draft-ietf-idr- > error-handling) should take the work that we have done within IDR > and GROW in this draft and build the next level of detail, which I > think that you've made a great start to. I would like to try and > keep the requirements draft such that it can be referred to by both > existing attributes, and future ones. I agree that the Requirements want to be at as high a level as possible. Requirements based on experience are good, too. The issue I have is that the draft seems to go into too much detail in some areas, and not enough in others. Looking at Sections 2.1.1: at a high level, a Critical error is one for which tearing down the session is unavoidable. That is deemed to be when the receiver cannot be sure that they have extracted from a broken UPDATE all the NLRI to which it refers (yes ?). If you don't have the NLRI, you are stuck. If you have the NLRI, it may be possible to contain the error so that it affects only those NLRI (for example, by "treat-as-withdraw"). If you have the NLRI and enough (however defined) valid (however defined) attributes, it may be possible to update the NLRI, perhaps partially or temporarily. [Actually, I'm not sure that is complete. As observed elsewhere in the draft, a given BGP session may be carrying a number of AFI/SAFI and possibly a number of separate VPN. So for some purposes perhaps it's not necessary to be able to identify all the NLRI to which the UPDATE applies, only the AFI/SAFI or VPN to which the UPDATE applies. If so, then perhaps there is a Requirement for Critical errors to be handled on a per AFI/SAFI and per VPN basis -- that is, a "semi-Critical" error which tears down a self-contained part of the session. I'm not sure whether the definition of a Critical BGP Error allows for this or not, depending on how one interprets "Errors Parsing the NLRI".] Staying with the high level requirements, if an error is to not be Critical, it appears one needs: (a) to be able to extract the NLRI (or AFI/SAFI etc ?) with some degree of certainty. (b) to have ways of dealing with that NLRI in ways that do not affect other NLRI learned in the session unnecessarily, and which do not cause unacceptable side effects. (c) to be able to extract some attributes with some degree of certainty, and be able to judge when proceeding to process an UPDATE with an incomplete or damaged set of attributes will yield sufficiently valid routes. (d) to have mechanisms to signal the problem so that the root cause(s) can be addressed and possibly to trigger other (e.g. operational) responses. for which there could usefully be some discussion of "degree of certainty", "affect...unnecessarily", "unacceptable side effects", "sufficiently valid" and so on -- from a routing information and an operational perspective. For example, if "treat as withdraw" is performed, but the receiver has not (for whatever reason) been able to extract all the NLRI sent, the receiver is left with some stale (possibly invalid) routes; that may be acceptable because the alternative (tearing down the session) is worse, or because other (operational/protocol) mechanisms will kick in to clean up... and so on. The draft appears to say that anything which is not a Critical Error is a Semantic one -- or vice versa. This appears to assume that in the parsing of attributes, an error in one attribute does not affect any other attribute -- in particular, that an error in a not-NLRI attribute does not affect the ability to reliably (enough ?) extract the NLRI. To support that, I think the document would need to go down into the nuts and bolts of the parsing mechanics. (Section 2.1.2 starts with "Where a BGP message is correctly formed"... I assume that means that the Message, Withdrawn Routes and Total Path Attribute Lengths are consistent, and the Marker is 'all-ones' ?) This is what I mean by both too much and too little detail. The high level requirement is to be able to extract NLRI (etc); whatever the issues in doing so are, they are perhaps at too great a level of detail for this document. On the other hand, the discussion of Semantic Errors does not go into sufficient detail to support the requirements which flow from (the apparently assumed) ability to parse attributes separately. Where an UPDATE does not contain a Critical error, the receiver has (by definition) the NLRI (which it believes it has received correctly) and perhaps some Attributes. What the receiver then does may depend on its confidence in what it has managed to extract from the broken message. All of that can be left as an exercise for implementers. The requirements should focus on the implications from a routing/operational perspective and offer some criteria for acceptable behaviour. The current standard requires (for safety) that any error invalidates everything learned in the Session. One step from there is that some errors only invalidate all the NLRI referred to in the erroneous UPDATE message -- which (for safety) discards all attributes in the message. A further step is that some errors do not invalidate the NLRI in the erroneous UPDATE message, but processing proceeds with some subset of the attributes. For my money, that further step is a giant step, and deserves to be covered at the Requirements level. [Another step is that some errors invalidate everything learned in the Session about a given AFI/SAFI or (possibly) VPN.] ... > [rjs]: Please note that the requirements draft does not present > distinctions such as recoverable and ignorable. We went around this > loop previously. I think that in some cases, some specific errors > may be handled by 'patching' or 'ignoring' specific errors. But > generically, these are exceptions - the requirements try and define > broader categories, if a particular attribute needs something else > (e.g., AS4_PATH may have information it can recover from other > attributes) then this can be handled in error handling solution > considerations of these attributes or as it is defined going > forward. I'm sorry to have missed the previous discussion. Suffice it to say that, as above, I think that Ignoring or Recovering (patching up) some errors is materially different to examining each attribute carefully and dumping the whole lot on the floor if any one is invalid. But this touches on the incompleteness (IMHO) of the classification. For me, one can consider a semantic error in an attribute only after establishing that it is (in my terms) correctly "framed". Once one has parsed a set of attributes, and concluded there is no reason to believe that some invalid attribute length has thrown the parser off track, then one can get into what to do with the contents of each attribute. I think the difficulty is (repeating myself, sorry) exactly illustrated by the question of what to do with an ATOMIC_AGGREGATE attribute which is apparently 421 octets long. At a requirements level, you may not wish to get into the detail of this. But as it stands, the draft classifies pretty much every way in which an attribute can be broken as a Semantic error. This seems to me to miss the important fact that an attribute is only an attribute once the Path Attributes part of the message has been parsed satisfactorily -- up to that point, the entire Path Attributes are a pretty random looking collection of octets. ... > [rjs]: The requirement the document makes is explicitly that not all > errors are defined as critical (if they were, the requirement > specified by section 3 would not be met, and we would stick with the > behaviour we have right now). The reason for a distinction between > critical and semantic is that there are certain errors that mean > that cannot be localised to certain NLRI. OK. Sure. We have Critical and Not-Critical errors. Not-Critical errors are those for which we can extract the NLRI. A key reason for not being able to extract the NLRI is encountering an error when parsing the Attributes. Some errors may suggest that the sender has gone barking mad, and it is not possible to say whether there are NLRI there to extract or not. Other errors may be less alarming. The given definition of Semantic errors does not distinguish. As above, the requirements could step back from the parsing issue, and specify only the need to reliably extract NLRI. And, if it is a requirement to proceed to process (as opposed to just invalidate) NLRI from a broken message, the requirements should specify the need to reliably extract a good enough subset of attributes to proceed with. At the very least, I suggest that Critical (severity of error) and Semantic (form of error) are orthogonal notions. > [rjs]: I hope you do not see these comments as dismissive of what > you have put together - I think that this is where operational and > implementation views diverge. My view is that I need to understand > what the impact to a service, the device and the network is during > these error conditions (and balance the risk of incorrectness > against the correctness of the protocol). From an implementation > perspective, clearly, one needs to understand exactly which > circumstances one can extract the NLRI, and the particulars of how > this is achieved. I would encourage discussion that falls into the > latter category such that we define the solutions draft to have the > relevant guidance where required. Comments on the former should > absolutely live in the requirements draft Sure. I am trying to make the case that where the Requirements touch on the implementation issues, it is going too deep. And, in lumping all kinds of errors together and deeming them to be Semantic errors, the issues there are obscured rather than brought into focus. If the Requirements were written without reference to the internal organisation of the BGP UPDATE message, that would be fine. As you say, what really matters is the operational impact of changes to the protocol which may include, inter alia: 1. some routes will be treated as "good" while others from the same source have been deemed invalid. This is the effect of, for example, "treat-as-withdraw". How much confidence can one have in the "good" routes, if the peer is sending a mixture of apparently valid and invalid stuff ? If a peer who sent a bunch of valid routes last week now sends a number of invalid ones, what do we think about the ones which remain "good" ? Should there be a mechanism to de-preference remaining "good" routes ? Is a response that has this effect required to be just the first step in a longer process, in which the cause of the error is dealt with ? In which case, can more risks be taken when selecting such a response ? 2. some routes will be treated as "good" which should be treated as invalid. This is the effect of not treating an error as Critical, but not identifying all affected NLRI. If this is not acceptable, then an implementation must take some care when deciding whether it has extracted all NLRI from a broken message. If there are degrees of acceptability, then an implementation would need to take a view... presumably based on some understanding of the likely operational impact ? Or there should be configuration knobs to twiddle ? 3. some routes will be treated as valid which would previously have been treated as invalid. If the rules for validating attributes are changed, then some routes might be accepted with a variety of issues with their attributes. If processing proceeds with a partial set of attributes, routing may be affected. If this is not acceptable, then an implementation must take great care here. While the question of which attributes and under what conditions this might happen is clearly an implementation issue, the acceptability of the result must be judged by its operational impact. Also important from an operational perspective may be how any new features to support better error handling are deployed. Clearly new code is involved. But if new capabilities and new code at both ends of an eBGP conversation are required, is that an issue ? Should new behaviour in BGP be required to be enabled by configuration, or enabled by default but with suitable override configuration options ? I am in violent agreement with you. The high level requirements are essentially operational ones. My suggestion is that the requirements would be improved by backing out of the discussion of the internal structure of BGP UPDATE messages. On the other hand, classifying errors in terms of what information is lost/preserved, would improve things. Such analysis might lead to requirements which can only be met by changes at the protocol/implementation level, and if so, more power to it, say I. [Those changes might be: (a) a greater separation of NLRI from Attributes, (b) more redundancy in the framing of attributes so that the parser can have greater confidence that it has identified all attributes sent, (c) a means to identify AFI/SAFI or VPN independent of the NLRI, even ?, (d) etc... in addition to the various other features mentioned in the draft for tearing down parts of a session, recovering parts of the RIB, signalling errors, monitoring etc. etc.] Thanks, Chris PS: section 2 refers to "analysis of incidents". Is there a collection somewhere one could take into consideration when making implementation decisions ? Is there evidence, for example, that well known attributes with invalid flags and/or lengths have been a problem in practice ?