On 15/06/2019 22:19, Johannes Berg wrote: > Hi Alexei, > > Sorry for messing up your address btw, not sure where I dug that one > up... > >>> 1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This >>> works for this particular case, but breaks some other cases; >>> evidently some places exist where skb->mac_len isn't even set to >>> ETH_HLEN when a packet gets to the bridge. I don't know right now >>> what that was, I think probably somebody who's CC'ed reported that. >>> >>> 2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but >>> this is rather asymmetric and strange, and while it works for this >>> case it may cause confusion elsewhere. >>> >>> 2b) Toshiaki said it might be better to make that code *remember* >>> mac_len and then use it to push and pull (so not caring about the >>> change made by skb_vlan_push()), but that ultimately leads to >>> confusion and if you have TC push/pop combinations things just get >>> completely out of sync and die >>> >>> 3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far >>> this also addresses the issue, but it's likely that this will break >>> OVS, and I don't know how it'd affect BPF. Quite possibly like TC >>> does and is broken, but perhaps not. >>> >>> >>> But now we're stuck. Depending on how you look at it, all of these seem >>> sort of reasonable, or not. >>> >>> Ultimately, the issue seems to be that we couldn't really decide whether >>> VLAN tags (and probably MPLS tags, for that matter) are covered by >>> mac_len or not. At least not consistently on ingress and egress. >>> eth_type_trans() doesn't take them into account, so of course on simple >>> ingress mac_len will only cover the ETH_HLEN. >>> >>> If you have an accelerated tag and then push it into the SKB, it will >>> *not* be taken into account in mac_len. OTOH, if you have a new tag and >>> use skb_vlan_push() then it *will* be taken into account. >>> >>> >>> I'm trending towards solution (3), because if we consider other >>> combinations of VLAN push/pop in TC, I think we can end up in a very >>> messy situation today. For example, POP/PUSH seems like it should be a >>> no-op, but it isn't due to the mac_len, *unless* it can use the HW accel >>> only (i.e. only a single tag). >>> >>> I think then to propose such a patch though we'd have to figure out >>> where the BPF case is, and to keep OVS working probably either add an >>> argument ("bool adjust_mac_len") to the function signatures, or just do >>> the adjustments in OVS code after calling them? >>> >>> >>> Any other thoughts? >> >> imo skb_vlan_push() should still change mac_len. >> tc, ovs, bpf use it and expect vlan to be part of L2. > > I'm not sure tc really cares, but it *is* a reasonable argument to make. > > Like I said, whichever way I look at the problem, a different solution > looks more reasonable ;-) > >> There is nothing between L2 and L3 :) >> Hence we cannot say that vlan is not part of L2. >> Hence push/pop vlan must change mac_len, since skb->mac_len >> is kernel's definition of the length of L2 header. > > I think we're getting to something here now. I actually thought about > this some more last night, and basically asked myself how I would design > it without all the legacy baggage. > > I'm certainly not suggesting we should change anything here, but to me > it was a bit of a clarification to do this and then see where we differ > in our handling today. > > Thinking along those lines, I sort of ended up with the following scheme > (just for the skb head, not the frags/fraglist): > > +------------------+----------------+---------------+ > headroom | eth | vlan | ... | IP | TCP | payload | tailroom > +------------------+----------------+---------------+ > ^ skb->head_ptr > ^ skb->l2_ptr > ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len > ... > ^ skb->payload_ptr > ^ skb->tail > > Now, I deliberately didn't put any "skb->data" here, because what we do > today is sort of confusing. > > By getting rid of the "multi-use" skb->data in this scheme I think it > becomes clearer to think about. > > > > On *egress*, all we really care about is this: > > +------------------+----------------+---------------+ > headroom | eth | vlan | ... | IP / TCP | payload | tailroom > +------------------+----------------+---------------+ > > ^ skb->data > skb->data + skb->len > ^ > > On *ingress*, however, we hide some of the data (temporarily): > > |--------- headroom ---------| > +------------------+----------------+---------------+ > | eth | vlan | ... | IP / TCP | payload | tailroom > +------------------+----------------+---------------+ > ^ skb->data - skb->mac_len > ^ skb->data > skb->data + skb->len ^ > > which is somewhat confusing to me, and sort of causes all these > problems. > > (It also makes it harder to reason about what data is actually valid in > the skb, although if mac_len is non-zero then it must be, but it means > you actually have less headroom and all). > > If instead we just made it like the hypothetical scheme I outlined > above, then on traversing a layer we'd set the next layer pointer > appropriately, and then each layer would just use the right pointer: > > * bridge/ethernet driver/... would use l2_ptr > * IP would use the l3_ptr > * TCP would use the l4_ptr (didn't put that into the picture) > * ... > > Now we wouldn't have a problem with the VLAN tags, because we'd just > appropriate set/keep all the pointers - bridge doesn't even care where > l3_ptr is pointing, but for IP it would of course point to beyond the > VLAN tags. > > (Now, if you wanted to implement this, you probably wouldn't have l2_ptr > but l2_offset etc. but that's an implementation detail.) > I do like the scheme outlined above, it makes it easier to reason about all of this, but obviously it'd require quite some changes. > Now, why am I writing all this? Because I think it points out that > you're absolutely right - we should treat mac_len as part of the frame > if we're in anything that cares about L2 like bridge. > >> Now as far as bridge... I think it's unfortunate that linux >> adopted 'vlan' as a netdevice model and that's where I think >> the problem is. > > I'm not sure. I don't exactly know where the problem is if we fix bridge > according to the patch (1) above, which, btw, was discussed before: > https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev@xxxxxxxxx/ > > Back then, Nikolay (whom I forgot to CC, fixed now) said: > >> It breaks connectivity between bridge and >> members when vlans are used. The host generated packets going out of the bridge >> have mac_len = 0. > > Which probably indicates that we're not even consistent with the egress > scheme I pointed out above, probably because we *also* have > hard_header_len? > IIRC, mac_len is only set on Rx, while on Tx it usually isn't. More below. > Maybe somewhere early in the egress path we should set skb->mac_len to > dev->hard_header_len, and then use skb->mac_len consistently, and > consider that part of the skb (and not arbitrarily consider ETH_HLEN to > be part of the skb in bridge). > > (This almost tempts me to actually try to implement the hypothetical SKB > scheme I described above, just so it's easier to understand what part > does what ... and to find where the issues like this occur) > Interesting idea. >> Typical bridge in the networking industry is a device that >> does forwarding based on L2. Which includes vlans. >> And imo that's the most appropriate way of configuring and thinking >> about bridge functionality. >> Whereas in the kernel there is a 'vlan' netdevice that 'eats' >> vlan tag and pretends that the rest is the same. >> So linux bridge kinda doesn't need to be vlan aware. >> CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't >> seen it being used and I'm not sure about state of things there. > > I think that ends up being a question of semantics. You can consider an > "industry bridge" that you describe to be a combination of VLAN and > bridge netdevs, and so it's just a question of what exactly you consider > a "bridge" - does it have to be a single netdev or not. > >> So your option 1 above is imo the best. The bridge needs to deal >> with skb->mac_len and full L2 header. > > Yeah, I guess. We're back to square 1 ;-) > > I'm not even sure I understand the bug that Nikolay described, because > br_dev_xmit() does: > > skb_reset_mac_header(skb); > eth = eth_hdr(skb); > skb_pull(skb, ETH_HLEN); > > so after this we *do* end up with an SKB that has mac_len == ETH_HLEN, > if it was transmitted out the bridge netdev itself, and thus how would > the bug happen? > I said *mac_len*. :) The above sets mac_header, at that point you'll have the following values: mac_len = 0, mac_header_len = 14 (skb_mac_header_len uses network_header - mac_header which is set there), but that is easy to overcome and if you do go down the path of consistently using and updating mac_len it should work. Cheers, Nik > Thanks, > johannes >