Re: VLAN tags in mac_len

Toshiaki Makita <toshiaki.makita1@xxxxxxxxx> · Mon, 17 Jun 2019 20:15:30 +0900

On 2019/06/16 4:19, Johannes Berg wrote:
Hi Alexei,

Sorry for messing up your address btw, not sure where I dug that one
up...

  1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
     works for this particular case, but breaks some other cases;
     evidently some places exist where skb->mac_len isn't even set to
     ETH_HLEN when a packet gets to the bridge. I don't know right now
     what that was, I think probably somebody who's CC'ed reported that.

  2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
     this is rather asymmetric and strange, and while it works for this
     case it may cause confusion elsewhere.

  2b) Toshiaki said it might be better to make that code *remember*
      mac_len and then use it to push and pull (so not caring about the
      change made by skb_vlan_push()), but that ultimately leads to
      confusion and if you have TC push/pop combinations things just get
      completely out of sync and die

  3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
     this also addresses the issue, but it's likely that this will break
     OVS, and I don't know how it'd affect BPF. Quite possibly like TC
     does and is broken, but perhaps not.

But now we're stuck. Depending on how you look at it, all of these seem
sort of reasonable, or not.

Ultimately, the issue seems to be that we couldn't really decide whether
VLAN tags (and probably MPLS tags, for that matter) are covered by
mac_len or not. At least not consistently on ingress and egress.
eth_type_trans() doesn't take them into account, so of course on simple
ingress mac_len will only cover the ETH_HLEN.

If you have an accelerated tag and then push it into the SKB, it will
*not* be taken into account in mac_len. OTOH, if you have a new tag and
use skb_vlan_push() then it *will* be taken into account.

I'm trending towards solution (3), because if we consider other
combinations of VLAN push/pop in TC, I think we can end up in a very
messy situation today. For example, POP/PUSH seems like it should be a
no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
only (i.e. only a single tag).

I think then to propose such a patch though we'd have to figure out
where the BPF case is, and to keep OVS working probably either add an
argument ("bool adjust_mac_len") to the function signatures, or just do
the adjustments in OVS code after calling them?

Any other thoughts?

imo skb_vlan_push() should still change mac_len.
tc, ovs, bpf use it and expect vlan to be part of L2.

I'm not sure tc really cares, but it *is* a reasonable argument to make.

Like I said, whichever way I look at the problem, a different solution
looks more reasonable ;-)

There is nothing between L2 and L3 :)
Hence we cannot say that vlan is not part of L2.
Hence push/pop vlan must change mac_len, since skb->mac_len
is kernel's definition of the length of L2 header.

I think we're getting to something here now. I actually thought about
this some more last night, and basically asked myself how I would design
it without all the legacy baggage.

I'm certainly not suggesting we should change anything here, but to me
it was a bit of a clarification to do this and then see where we differ
in our handling today.

Thinking along those lines, I sort of ended up with the following scheme
(just for the skb head, not the frags/fraglist):

           +------------------+----------------+---------------+
  headroom | eth | vlan | ... | IP  | TCP      | payload       | tailroom
           +------------------+----------------+---------------+
^ skb->head_ptr
           ^ skb->l2_ptr
                              ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len
                                     ...
                                               ^ skb->payload_ptr
                                                               ^ skb->tail

Now, I deliberately didn't put any "skb->data" here, because what we do
today is sort of confusing.

By getting rid of the "multi-use" skb->data in this scheme I think it
becomes clearer to think about.

On *egress*, all we really care about is this:

           +------------------+----------------+---------------+
  headroom | eth | vlan | ... | IP / TCP       | payload       | tailroom
           +------------------+----------------+---------------+

^ skb->data
                                          skb->data + skb->len
^

On *ingress*, however, we hide some of the data (temporarily):

|--------- headroom ---------|
           +------------------+----------------+---------------+
           | eth | vlan | ... | IP / TCP       | payload       | tailroom
           +------------------+----------------+---------------+
           ^ skb->data - skb->mac_len
                              ^ skb->data
                                          skb->data + skb->len ^

which is somewhat confusing to me, and sort of causes all these
problems.

(It also makes it harder to reason about what data is actually valid in
the skb, although if mac_len is non-zero then it must be, but it means
you actually have less headroom and all).

If instead we just made it like the hypothetical scheme I outlined
above, then on traversing a layer we'd set the next layer pointer
appropriately, and then each layer would just use the right pointer:

  * bridge/ethernet driver/... would use l2_ptr
  * IP would use the l3_ptr
  * TCP would use the l4_ptr (didn't put that into the picture)
  * ...

Now we wouldn't have a problem with the VLAN tags, because we'd just
appropriate set/keep all the pointers - bridge doesn't even care where
l3_ptr is pointing, but for IP it would of course point to beyond the
VLAN tags.

(Now, if you wanted to implement this, you probably wouldn't have l2_ptr
but l2_offset etc. but that's an implementation detail.)

Now, why am I writing all this? Because I think it points out that
you're absolutely right - we should treat mac_len as part of the frame
if we're in anything that cares about L2 like bridge.

Now as far as bridge... I think it's unfortunate that linux
adopted 'vlan' as a netdevice model and that's where I think
the problem is.

I'm not sure. I don't exactly know where the problem is if we fix bridge
according to the patch (1) above, which, btw, was discussed before:
https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev@xxxxxxxxx/

Back then, Nikolay (whom I forgot to CC, fixed now) said:

It breaks connectivity between bridge and
members when vlans are used. The host generated packets going out of the bridge
have mac_len = 0.

Which probably indicates that we're not even consistent with the egress
scheme I pointed out above, probably because we *also* have
hard_header_len?

Maybe somewhere early in the egress path we should set skb->mac_len to
dev->hard_header_len, and then use skb->mac_len consistently, and
consider that part of the skb (and not arbitrarily consider ETH_HLEN to
be part of the skb in bridge).

(This almost tempts me to actually try to implement the hypothetical SKB
scheme I described above, just so it's easier to understand what part
does what ... and to find where the issues like this occur)

Typical bridge in the networking industry is a device that
does forwarding based on L2. Which includes vlans.
And imo that's the most appropriate way of configuring and thinking
about bridge functionality.
Whereas in the kernel there is a 'vlan' netdevice that 'eats'
vlan tag and pretends that the rest is the same.
So linux bridge kinda doesn't need to be vlan aware.
CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
seen it being used and I'm not sure about state of things there.

I think that ends up being a question of semantics. You can consider an
"industry bridge" that you describe to be a combination of VLAN and
bridge netdevs, and so it's just a question of what exactly you consider
a "bridge" - does it have to be a single netdev or not.

So your option 1 above is imo the best. The bridge needs to deal
with skb->mac_len and full L2 header.

Yeah, I guess. We're back to square 1 ;-)

I'll try to explain the problem I see, which cannot be fixed by option 1...
The bug is in tcf_vlan_act(), and mainly in skb->data, not in mac_len.

Consider about vlan packets from NIC, but non-hw-accelerated, where
vlan devices are configured to receive them.

When __netif_receive_skb_core() is called, skb is like this.

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
      ^
     data

skb->data is at the beginning of the vlan header.
This is reasonable because we did not process the vlan tag at this point.

Then after vlan_do_receive() (receive the skb on a vlan device), the skb is like this.

+-----+--------
| eth | TCP/IP
+-----+--------
      ^
     data

Or if reorder_hdr is off (which does not remove vlan tags when receiving on vlan devices),

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
             ^
            data

Relying on this mechanism, we are currently able to handle multiple vlan tags.

For example if we have 2 tags,

- On __netif_receive_skb_core() invocation

+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
      ^
     data

- After first vlan_do_receive()

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
      ^
     data

Or if reorder_hdr is off,

+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
             ^
            data

When we process one tag, the data goes forward by one tag.

Now looking at TC vlan case...

After it inserts two tags, the skb looks like:

(The first tag is in vlan_tci)
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
             ^
            data

The data pointer went forward before we process it.
This is apparently wrong. I think we don't want to (or cannot?) handle cases like this
after tcf_vlan_act(). This is why I said we should remember mac_len there.

So, my opinion is:
On ingress, data pointer can be at the end of vlan header and mac_len probably should
include vlan tag length, but only after the vlan tag is processed.

Bridge may need to handle mac_len that is not equal to ETH_HLEN but to me it's a
different problem.

Toshiaki Makita