Hi all, Sorry for the long CC list - it's a bit unclear where the issue is. Zahari has been looking at this forever, and we keep talking about it, but every time we squint and look at it in a different way, the issue appears in a different place. Ultimately, the issue appears to be that we cannot make up our mind whether we consider VLAN tags to be covered by mac_len or not. Some history first. For OVS datapath, Simon evidently considered the VLAN tags to be part of the mac_len, since the OVS VLAN push/pop actions manipulate it, see commit 25cd9ba0abc07 ("openvswitch: Add basic MPLS support to kernel"). It's not clear to me, however, why it previously wasn't considered and only MPLS needed to consider it this way - perhaps nobody ever tried double-VLAN tagging or something. Anyway, then in commit 93515d53b133 ("net: move vlan pop/push functions into common code") Jiri moved the code as is into common code, which still kinda made sense since OVS was the only user. Maybe at this point we should've asked whether or not the mac_len manipulations make sense or not though. Now clearly Jiri had an agenda (btw, sorry for misspelling your name all the time), and followed up with commit c7e2b9689ef8 ("sched: introduce vlan action"). I assume this was tested, but probably not in the scenario we have now. There's also commit 4e10df9a60d9 ("bpf: introduce bpf_skb_vlan_push/pop() helpers") which then makes all of this get used in BPF. The bridge code, on the other hand, has basically always (I stopped looking when I hit the end of current git history) assumed that VLAN tags are not part of mac_len, and does skb_push(ETH_HLEN). Next, the scenario. Basically, what Zahari is trying to do is to use TC to push *two* VLAN tags on ingress, before the packet then goes to the bridge. This results in the following flow: We start with an ingress skb, somewhere: -> eth_type_trans() => skb->mac_len = 14, skb->data = ethhdr + 14 Push the first tag with TC: -> tc vlan push (tag1) -> skb_push(mac_len) => skb->data = ethhdr -> skb_vlan_push(tag1) -> __vlan_hwaccel_put_tag(tag1) -> skb_pull(mac_len) => skb->data = ethhdr + 14 => no changes in SKB other than recorded VLAN acceleration Push the second tag with TC -> tc vlan push (tag2) -> skb_push(mac_len) * skb->data = ethhdr -> skb_vlan_push(tag2) -> __vlan_insert_tag() -> skb->mac_len += 4 => skb->mac_len = 18 -> __vlan_hwaccel_put_tag(tag2) -> skb_pull(mac_len) => skb->data = ethhdr + 18 -> bridge -> br_dev_queue_push_xmit() -> skb_push(14) => skb->data = ethhdr + 4 (!!!) -> dev_queue_xmit() => as a result, now the first 4 bytes of the frame are lost, as on xmit drivers expect to start at skb->data (There's some more complication here if we actually don't have HW offload for the VLAN tag, but the end result is basically the same) Possible solutions? So far, Zahari tried three different ways of fixing this: 1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This works for this particular case, but breaks some other cases; evidently some places exist where skb->mac_len isn't even set to ETH_HLEN when a packet gets to the bridge. I don't know right now what that was, I think probably somebody who's CC'ed reported that. 2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but this is rather asymmetric and strange, and while it works for this case it may cause confusion elsewhere. 2b) Toshiaki said it might be better to make that code *remember* mac_len and then use it to push and pull (so not caring about the change made by skb_vlan_push()), but that ultimately leads to confusion and if you have TC push/pop combinations things just get completely out of sync and die 3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far this also addresses the issue, but it's likely that this will break OVS, and I don't know how it'd affect BPF. Quite possibly like TC does and is broken, but perhaps not. But now we're stuck. Depending on how you look at it, all of these seem sort of reasonable, or not. Ultimately, the issue seems to be that we couldn't really decide whether VLAN tags (and probably MPLS tags, for that matter) are covered by mac_len or not. At least not consistently on ingress and egress. eth_type_trans() doesn't take them into account, so of course on simple ingress mac_len will only cover the ETH_HLEN. If you have an accelerated tag and then push it into the SKB, it will *not* be taken into account in mac_len. OTOH, if you have a new tag and use skb_vlan_push() then it *will* be taken into account. I'm trending towards solution (3), because if we consider other combinations of VLAN push/pop in TC, I think we can end up in a very messy situation today. For example, POP/PUSH seems like it should be a no-op, but it isn't due to the mac_len, *unless* it can use the HW accel only (i.e. only a single tag). I think then to propose such a patch though we'd have to figure out where the BPF case is, and to keep OVS working probably either add an argument ("bool adjust_mac_len") to the function signatures, or just do the adjustments in OVS code after calling them? Any other thoughts? Thanks, Zahari & Johannes