Re: [Tsv-art] Tsvart early review of draft-ietf-trill-over-ip-10 - ECN & DSCP considerations

Donald Eastlake <d3e3e3@xxxxxxxxx> · Fri, 30 Jun 2017 21:43:13 -0400

Hi David,

On Mon, Jun 26, 2017 at 3:04 PM, Black, David <David.Black@xxxxxxxx> wrote:
> Adding some comments on ECN and DSCP ...
>
>> > Section 4.3:
>> >
>> >    TRILL over IP implementations MUST support setting the DSCP value in
>> >    the outer IP Header of TRILL packets they send by mapping the TRILL
>> >    priority and DEI to the DSCP. They MAY support, for a TRILL Data
>> >    packet where the native frame payload is an IP packet, mapping the
>> >    DSCP in this inner IP packet to the outer IP Header with the default
>> >    for that mapping being to copy the DSCP without change.
>> >
>> > I think it is fine to require that implementations are capable of setting
>> > DSCP values on the outer IP header. However, I fail to see any discussion of
>> > the potential issues with actually setting the DSCP values. It is one thing to
>> > do this in an IP back bone use case where one can know and have control
>> > over the PHB that the DSCP values maps to. But otherwise, over general
>> > internet the
>> > behavior is not that predictable. One can easily be subject to policers or
>> > remapping. Also as the actual DSCP code point usage is domain specific this is
>> > difficult. Priority reversal is likely the least of the problems that this can
>> > run into over general Internet.
>>
>> It sounds like appropriate discussion and warnings about these issues
>> would resolve the above comment.
>
> For ECN, see RFC 6040 and draft-ietf-tsvwg-rfc6040update-shim.  In particular,
> copying the inner ECN codepoint to the outer IP header encapsulation without
> requiring decapsulation processing as specified in RFC 6040 or the 6040update-shim
> draft can lose congestion indications from the network and hence is wrong
> (it's also wrong wrt RFC 3168, but RFC 6040 and the 6040update-shim drafts are
> better and more current references).

That's a good point.

> For DSCPs, start with RFC 2983 - thinking about the validity (or likely validity)
> of the outer DSCP at the decapsulator may help in choosing whether to
> recommend a uniform model (e.g., copy DSCP out at ingress, copy back in at
> egress) or a pipe model (e.g., do something reasonable for outer DSCP at
> ingress, ignore it on egress) as the implementation default.

I believe the default behavior in the current draft is the best
default. That sets DSCP based on the same TRILL Header indicia that
controls default QoS on non-IP links.

> -- DSCP mapping to/from TRILL/Ethernet priorities
>
>> The intent in the draft is to reflect the default relative priority of
>> the different priority code points in IEEE Std 802.1Q where priority 1
>> is lower than priority 0. At a quick look, it appears to me that RFC
>> 2474 requires that 0x001000 be handled as being of a priority not
>> lower than the priority with which 0x000000 is handled. Yet RFC 3662,
>> which you point to, seems to suggest using 0x001000 as a lower
>> priority code point than 0x000000. Given that 3662 not only does not
>> update 2474 but is only Informational while 2474 is Standards Track, I
>> would say that 2474 dominates and that this draft makes the best
>> assumptions it can about default behavior...
>
> Well ... that's a discussion about text in RFCs that are well over a decade
> old, and in an area (less-than-best-effort service) where the aspirations
> of at least RFC 3662 weren't realized ... but that RFC is not safe to ignore,
> either.
>
> In practice, the specification of CS1 for less-than-best-effort service has
> been promulgated by RFC 4594 rather than RFC 3662, and RFC 4594 has
> had significant "running code" impact on network design and operation.
>
> As Magnus mentioned RFC7657, I strongly suggest starting from the
> RFC 7657 discussion of this topic in order to figure out what to do.  I'm
> not sure what to recommend, but I do think that starting from
> RFC 7657 (rather than RFC 2474 and RFC 3662) is the better approach.

OK.

> FWIW, the TSVWG WG is in the process of figuring out which DSCP
> to recommend for less-than-best-effort-service in place of CS1 - that's
> likely to be an active topic of discussion in Prague.

I'll try to attend that session.

Thanks,
Donald
===============================
 Donald E. Eastlake 3rd   +1-508-333-2270 (cell)
 155 Beaver Street, Milford, MA 01757 USA
 d3e3e3@xxxxxxxxx

> Thanks, --David
>
>> -----Original Message-----
>> From: Tsv-art [mailto:tsv-art-bounces@xxxxxxxx] On Behalf Of Donald
>> Eastlake
>> Sent: Sunday, June 25, 2017 8:07 PM
>> To: Magnus Westerlund <magnus.westerlund@xxxxxxxxxxxx>
>> Cc: tsv-art@xxxxxxxx; draft-ietf-trill-over-ip.all@xxxxxxxx; IETF Discussion
>> <ietf@xxxxxxxx>; trill@xxxxxxxx
>> Subject: Re: [Tsv-art] Tsvart early review of draft-ietf-trill-over-ip-10
>>
>> Hi Magnus,
>>
>> Thanks for the extensive review. See my responses below.
>>
>> On Thu, Jun 15, 2017 at 1:32 PM, Magnus Westerlund
>> <magnus.westerlund@xxxxxxxxxxxx> wrote:
>> >
>> > Reviewer: Magnus Westerlund
>> > Review result: Not Ready
>> >
>> > Early review of draft-ietf-trill-over-ip-10
>> > Reviewer: Magnus Westerlund
>> > Review result: Not Ready
>> >
>> > TSV-ART review comments:
>> >
>> > I have set this to not ready as there are several issues, some significant that
>> > could affect the protocol realization significantly. Some may be me missing
>> > things in TRILL, I was not that familiar with it before this review and I have
>> > only tried looking up things, not reading the whole earlier specifications. So
>> > don't hesitate to push back and provide pointers to things that can resolve
>> > issues. The authors and the WG clearly have thought about a lot of issues
>> and
>> > dealt with much already.
>>
>> OK. Hopefully we can resolve these one way or the other.
>>
>> > Diffserv usage
>> > --------------
>> >
>> > Section 4.3:
>> >
>> >    TRILL over IP implementations MUST support setting the DSCP value in
>> >    the outer IP Header of TRILL packets they send by mapping the TRILL
>> >    priority and DEI to the DSCP. They MAY support, for a TRILL Data
>> >    packet where the native frame payload is an IP packet, mapping the
>> >    DSCP in this inner IP packet to the outer IP Header with the default
>> >    for that mapping being to copy the DSCP without change.
>> >
>> > I think it is fine to require that implementations are capable  of setting
>> > DSCP values on the outer IP header. However, I fail to see any discussion of
>> > the potential issues with actually setting the DSCP values. It is one thing to
>> > do this in an IP back bone use case where one can know and have control
>> over
>> > the PHB that the DSCP values maps to. But otherwise, over general
>> internet the
>> > behavior is not that predictable. One can easily be subject to policers or
>> > remapping. Also as the actual DSCP code point usage is domain specific this
>> is
>> > difficult. Priority reversal is likely the least of the problems that this can
>> > run into over general Internet.
>>
>> It sounds like appropriate discussion and warnings about these issues
>> would resolve the above comment.
>>
>> > Section 4.3:
>> >
>> >    The default TRILL priority and DEI to DSCP mapping, which may be
>> >    configured per TRILL over IP port, is an follows. Note that the DEI
>> >    value does not affect the default mapping and, to provide a
>> >    potentially lower priority service than the default priority 0,
>> >    priority 1 is considered lower priority than 0. So the priority
>> >    sequence from lower to higher priority is 1, 0, 2, 3, 4, 5, 6, 7.
>> >
>> >       TRILL Priority  DEI  DSCP Field (Binary/decimal)
>> >       --------------  ---  -----------------------------
>> >                   0   0/1  001000 / 8
>> >                   1   0/1  000000 / 0
>> >                   2   0/1  010000 / 16
>> >                   3   0/1  011000 / 24
>> >                   4   0/1  100000 / 32
>> >                   5   0/1  101000 / 40
>> >                   6   0/1  110000 / 48
>> >                   7   0/1  111000 / 56
>> >
>> > This appear to be an problematic mapping. At least for prio 0 and 1. As
>> > priority 1 appears to be intended to be higher than priority 0, it is
>> > interesting that it is mapped to CS1, which to quote
>> > https://datatracker.ietf.org/doc/rfc7657/:
>> >
>> > CS1 ('001000') was subsequently designated as the recommended
>> >       codepoint for the Lower Effort (LE) PHB [RFC3662].
>> >
>> > So what is proposed can in a network using default mapping, result in that
>> you
>> > get priority 0 to be lower priority than 1. Plus that in some networks this can
>> > also results in strange remapping that results in a different PHB for CS1
>> than.
>>
>> The intent in the draft is to reflect the default relative priority of
>> the different priority code points in IEEE Std 802.1Q where priority 1
>> is lower than priority 0. At a quick look, it appears to me that RFC
>> 2474 requires that 0x001000 be handled as being of a priority not
>> lower than the priority with which 0x000000 is handled. Yet RFC 3662,
>> which you point to, seems to suggest using 0x001000 as a lower
>> priority code point than 0x000000. Given that 3662 not only does not
>> update 2474 but is only Informational while 2474 is Standards Track, I
>> would say that 2474 dominates and that this draft makes the best
>> assumptions it can about default behavior...
>>
>> > MTU and Fragmentation
>> > ---------------------
>> >
>> > I think there are two main issue here. The first one is MTUD discovery
>> > of the actual IP path MTU between the ports. That will be needed to
>> prevent
>> > a lot of traffic going into MTU black holes. Especially as TRILL requries
>> > 1470 byte support which is likey above a lot of paths.
>>
>> Seems like it would depend on the environments where TRILL was used.
>> For example, I do not think 1470 would be a problem in most Data
>> Center or Internet Exchange point uses, for example. Data Centers
>> sometimes support 9K jumbo frames and the like.
>>
>> In fact, it is probably bad to focus too much on 1470 -- that is a
>> required minimum to be sure that reasonable size link state PDUs can
>> be successfully flooded through the TRILL campus so that routing will
>> work. However, it would commonly be the case that, for the TRILL
>> campus to be useful in a particular case, links need to be able to
>> carry the expected size TRILL Data packets. For example, if there were
>> two parts of a TRILL campus connected by one or a few TRILL over IP
>> links and the end stations in each part were assuming they could use
>> 1500 byte Ethernet packets, then the TRILL over IP links would need to
>> support an MTU based on 1500 + TRILL Header + IP and TRILL over IP
>> encapsulation. And more if security was being used or there were any
>> other reasons for additional headers/encapsulation...
>>
>> > Section 8.4:
>> >
>> >    Path MTU discovery [RFC4821] should be useful
>> >    in determining the IP MTU between a pair of RBridge ports with IP
>> >    connectivity.
>> >
>> > The issue with RFC4821 is that it has requirements on the packetization
>> layer.
>> > Trill appears to have several components that are useful. However, it will
>> > require a specification of the procedure to result in a useful tool.
>>
>> See below.
>>
>> > Section 8.4:
>> >
>> >    TRILL IS-IS MTU PDUs, as specified in Section 5 of [RFC6325] and in
>> >    [RFC7177], can be used to obtain added assurance of the MTU of a
>> >    link.
>> >
>> > Yes, that can confirm working MTUs that are at 1470 or above, but appears
>> > prevented from working below 1470?
>>
>> While there is a minimum size for TRILL IS-IS MTU PDUs, determined by
>> header size, it is well below 1470, probably (depending on whether
>> secuirty is in use, etc.) below 150 bytes.
>>
>> > Thus, it appears that there is a lack of mechanism here to actually get a valid
>> > and functional MTU from TRILL in the cases where the Path MTU is below
>> 1470. If
>> > I am wrong good, but I think this is an important piece for how to handle
>> the
>> > next main issue.
>>
>> How about referencing Section 3 of
>> https://tools.ietf.org/html/draft-ietf-trill-mtu-negotiation-05
>> which is currently in IETF Last Call? (The wording of that section is
>> probably going to be improved based on an OPS review by Brian
>> Carpenter.)
>>
>> > UDP encapsulation and IP fragments.
>>   ----------------------------------
>> > I see it as a big issue that UDP encapsulation is the native one, and that
>> > relies on IP fragmentation despite the need for reliable fragmentation.
>> With
>> > the setup of having to support 1470 MTU on TRILL level some packets will
>> be
>> > fragmented in many environments. That will lead to a lot of losses, and as
>> > discussed below a very big problem with middleboxes. The main problem
>> here is
>> > that if one tries to rely on IP fragments one will have issues with packets
>> > ending up in black holes. And different problems depending on IPv4 or
>> IPv6.
>> > IPv6 is lilkely the lesser problem assuming that one have working PMTUD.
>> >
>> > There are several ways out of this.
>> >
>> > 1. Detect issues and use TCP encapsulation with correctly set MSS to not
>> get IP
>> > fragements 2. Determine MTU and implement an fragmentation
>> mechanism on top of
>> > UDP.
>>
>> So, I don't see that much problem with UDP being the general default
>> consistent with the TRILL philosophy of defaulting to need zero or
>> minimal configuration. The default should be to use multicast Hellos
>> for discovery of neighbors which sure points at UDP to me. Having to
>> traverse a NAT should be a rare case. Since, in the NAT case, you have
>> to configure things related to the static binding and the IP
>> address(es) of peer(s) anyway you can also configure to use a
>> different encapsulation than UDP, such as TCP, at the same time. I
>> don't see it as much of a problem if, by default, TRILL won't operate
>> through a NAT. If you are using UDP and it fragments and fragments are
>> dropped at a NAT, probably you can't exchange Hellos so you will not
>> form an adjacency and anything on the other side of the NAT will not
>> be visible.
>>
>> > Zero Checksum:
>> > --------------
>> >
>> > Section 5.4:
>> >
>> > UDP Checksum - as specified in [RFC0768]
>> >
>> > Considering the fast path encapsulation desire, I am surprised to not see
>> any
>> > mentioning of use of zero checksum here. Raising the zero checksum and
>> forward
>> > reference would be good I think.
>> >
>> > And then Section 8.5:
>> >
>> >    The requirements for the usage of the zero UDP Checksum in a UDP
>> >    tunnel protocol are detailed in [RFC6936]. These requirements apply
>> >    to the UDP based TRILL over IP encapsulations specified herein
>> >    (native and VXLAN), which are applications of UDP tunnel.
>> >
>> > If you actually intended to allow zero checksum, then you actually should
>> > document that Trill fulfills the requirements that the applicability statement
>> > raises. I have not analyzed how well it meets these requirements.
>> >
>> > Please review Section 6.2 of RFC 8086 for example how that can be done.
>>
>> OK. We'll look into it.
>>
>> > TCP Encapsulation issue
>> > -----------------------
>> >
>> > Section 5.6:
>> >
>> > The TCP encapsulation appear to be missing an delimiter format allowing
>> each
>> > individual TRILL packet/payload to be read out of the TCP's byte stream. In
>> > other words, a normal implementation has no way of ensuring that the TCP
>> > payload starts with the start of a new TRILL payload. Multiple small TRILL
>> > payloads may be included in the same TCP payload, and also only parts as
>> TCP is
>> > one way of dealing with TRILL packets that are larger than the
>> IP+Encapsulation
>> > MTU that actually will work.
>> >
>> > This comment is based on that there appear to be no length fields included
>> in
>> > the TRILL header. The most straight forward delimiter is a 2-byte length
>> field
>> > for the TRILL payload to be encapsulated.
>>
>> Right. It might also be useful to include some sort of check field, as
>> is done in BGP, to detect if you are out of sync in parsing the TCP
>> stream.
>>
>> Another point is that, while with UDP it seems fine to send packets
>> with assorted QoS, you don't want to encourage re-ordering of TCP
>> packets in a stream. So if TCP encapsulation is being used, you want
>> to use the same DSCP value for the packets in a particular TCP stream.
>> So, generally, you need to have a TCP connection per priority handling
>> category. Mapping the 8 priority levels into a smaller number of
>> handling categories is a normal thing to do so you certainly don't
>> necessarily need 8 TCP connections. Adding material on this should not
>> be too hard.
>>
>> > Section 5.6:
>> >
>> > TCP endpoint requirements. I do wonder if an application like TRILL actual
>> > would need to discuss performance impacting implementation choices or
>> > limitations. For example use of NAGLE, the requirements on buffer sizes in
>> > relation to Bandwidth delay products, as buffer memory in a RBridge will
>> impact
>> > performance.
>>
>> Well, I'm not sure how deeply this document should get into such
>> performance issues. What about just saying something about
>> consideration being given to tuning TCP for performance and pointing
>> to one or a few other RFCs that talk about this?
>>
>> > Congestion Control
>> > ------------------
>> > First thanks for the effort here.
>>
>> You're welcome.
>>
>> > 8.1.2 In Other Environments
>> >
>> >    Where UDP based encapsulation headers are used in TRILL over IP in
>> >    environments other than those discussed in Section 8.1.1, specific
>> >    congestion control mechanisms are commonly needed.  However, if the
>> >    traffic being carried by the TRILL over IP link is already congestion
>> >    controlled and the size and volatility of the TRILL IS-IS link state
>> >    database is limited, then specific congestion control may not be
>> >    needed. See [RFC8085] Section 3.1.11 for further guidance.
>> >
>> > This is correct, however my question is if the RBridges have any way of
>> knowing
>> > which traffic is actually congestion controlled, considering that TRILL
>> provides
>> > an layer 2 abstraction. I wonder if there should be any type of white list of
>> > the types of layer 2 payloads that can be assumed to be congestion
>> controlled,
>> > and thus okay to forward over IP paths? I am worried that without any
>> > recommendation to prevent traffic that is not controlled to be forwarded,
>> can
>> > lead to congestion issues.
>> >
>> > The other issue I think may exist is the issue serial unicast emulation of
>> > broadcast/multicast creates. As this amplifies the outgoing packet rate with
>> > a factor of how many addresses are configured for serial unicast this can
>> > be significant traffic expansion. Thus, I think additional considerations are
>> > needed here, and maybe rate limiting of the amount of traffic to be
>> multicasted.
>>
>> OK. We can think about those issues.
>>
>> > Flow and ECMP
>> > -------------
>> >
>> > Section 8.3:
>> >
>> > For example, for TRILL
>> >    Data, this entropy field could be based on some hash of the
>> >    Inner.MacDA, Inner.MacSA, and Inner.VLAN or Inner.FGL.
>> >
>> > I would appreciate clearer references to what these fields are.
>>
>> In a TRILL Data packet, the payload after the TRILL Header looks like
>> an Ethernet frame except that there is always either a VLAN tag or,
>> alternatively, where the VLAN tag would be, a Fine Grained Label
>> [RFC7172]. (The preceding is the view in the TRILL RFCs, but there is
>> an equivalent and equally valid view in which all the fields through
>> and including the VLAN or FGL tag are part of the TRILL Header.) The
>> TRILL base protocol specification focuses on Ethernet as a link
>> technology between TRILL switches, in which case there will be a link
>> header including an Outer.MacDA and Outer.MacSA fields and possibly an
>> Outer.VLAN, all before the TRILL Header. See Figure 1 and Figure 2 in
>> RFC 7172.
>>
>> Some of the above could be added to the draft for clarity.
>>
>> > If I understand this correctly, the idea here is to look into the inner
>> > layer 2 frames, and use the flow equivalents that exists on that level and
>> > hash that into value that maps the flows onto the source port range.
>>
>> Yes.
>>
>> > I think this text should include a summary of the principle and ensure to
>> > note the important requirement that what is considered flows in the inner
>> > must not result in being striped over multiple source ports as this may lead
>> to
>> > reordering issues due to packets taking different paths.
>>
>> Well, we can add some text. But when would the relative ordering
>> matter for two TRILL Data packets where the two inner native payloads
>> have different values for any one or more of these three fields
>> (Inner.MacDA, Inner.MacSA, and inner VLAN/FGL tag) ? If any of those
>> fields are different, you are talking about different streams.
>>
>> > NAT and TRILL over IP:
>> > Section 8.5:
>> >
>> > If one like to use TRILL over IP through a NAT, then there are some very
>> > important considerations that are missing. First the need for static binding
>> > configurations or the need for determining ones external address(es) and
>> be
>> > able to communicate that to the peer RBridges, and in addition ensure that
>> one
>> > has keep-alives to that the NAT binding never times out.
>>
>> I think those are good points. There is an additional problem that
>> TRILL Hellos detect neighbors with which they have 2-way connectivity
>> by indicating, inside the Hellos that are sent, from what neighbors
>> Hellos have been received on that port. If a NAT is involved, these
>> neighbor addresses inside Hellos need to be mapped.
>>
>> > Next is the issue that there is almost zero chance of getting a IP/UDP
>> > encapsulation TRILL payload through the NAT if it results in IP
>> fragmentation,
>> > as NATs don't do defragment and refragmented on the internal side, and
>> an IP
>> > fragment lacks UDP port and thus can't be matched to binding.
>>
>> So perhaps the recommendation should be to configure the port to use
>> TCP if there will be fragmentation.
>>
>> > Also if you like to run IP/ESP through a NAT, then you most likely need the
>> > IP/UDP/ESP encapsulation (https://tools.ietf.org/html/rfc3948). Note that
>> this
>> > will restrict the MTU even further and thus ensure that the 1470
>> requirement
>> > cannot be fulfilled even without additional tunnels over an 1500 bytes MTU
>> > Ethernet infrastructure.
>> >
>> > I would note that also firewalls likely have issues with IP fragments for the
>> > same reason, they require significant amount of state to be verified if they
>> > should be let through.
>> >
>> > In general I think you should create a configuration that has chance to work
>> > through most middleboxes, but I think you should require static bindings. I
>> > think that configuration is, and don't laugh now, but
>> IP/UDP/ESP/TCP/TRILL,
>> > otherwise you will not be able to have both security and reliable
>> fragmentation
>> > of TRILL packets.
>>
>> OK. Thanks again for this review. It has pointed out a number of
>> problems and in thinking about those, I believe a couple of further
>> problems have come to mind that I mentioned above. We'll work on a
>> revised draft.
>>
>> Thanks,
>> Donald
>> ===============================
>>  Donald E. Eastlake 3rd   +1-508-333-2270 (cell)
>>  155 Beaver Street, Milford, MA 01757 USA
>>  d3e3e3@xxxxxxxxx
>>
>> > Cheers
>> >
>> > Magnus Westerlund
>>
>> _______________________________________________
>> Tsv-art mailing list
>> Tsv-art@xxxxxxxx
>> https://www.ietf.org/mailman/listinfo/tsv-art