On 03/28, Willem de Bruijn wrote: > > > > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol > > > > > > > > > > and move on. > > > > > > > > > > > > > > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future. > > > > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector. > > > > > > > > > > But it feels like it's getting into bpf-next territory :-) > > > > > > > > > > > > > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether > > > > > > > > > progs/bpf_flow.c is relying on that or not. > > > > > > > > > So far I think you're saying that in all three cases: > > > > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right? > > > > > > > > > This has to be preserved. > > > > > > > > It points to L3 (or vlan). And this will be preserved, I have no > > > > > > > > intention to change that. > > > > > > > > > > > > > > > > Just to make sure, we are on the same page, here is what > > > > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff. > > > > > > > > > > > > > > > > NO-VLAN is always the same for both with-skb/no-skb: > > > > > > > > +----+----+-----+--+ > > > > > > > > |DMAC|SMAC|PROTO|L3| > > > > > > > > +----+----+-----+--+ > > > > > > > > ^ > > > > > > > > +-- nhoff > > > > > > > > proto = PROTO > > > > > > > > > > > > > > > > VLAN no-skb (eth_get_headlen): > > > > > > > > +----+----+----+---+-----+--+ > > > > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3| > > > > > > > > +----+----+----+---+-----+--+ > > > > > > > > ^ > > > > > > > > +-- nhoff > > > > > > > > proto = TPID > > > > > > > > > > > > > > where ctx->data will point to ? > > > > > > > These nhoff differences are fine. > > > > > > > I want to make sure that ctx->data is the same for all. > > > > > > For with-skb, nhoff would be zero, and ctx->data would point to > > > > > > TCI/L3. > > > > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be > > > > > > non-zero (TCI/L3 offset). > > > > > > > > > > > > If you want, for skb-less case, when calling BPF program we can do the math > > > > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0. > > > > > > But I'm not sure whether we need to do that; flow dissector is supposed > > > > > > to look at ctx->data + nhoff, it should not matter what each individual > > > > > > value is, they only make sense together. > > > > > > > > > > My strong preference is to have data to point to L2 in all cases. > > > > > Semantics of requiring bpf prog to start processing from a tuple > > > > > (data + nhoff) where both point to random places is very confusing. > > > > > > > > Since flow dissection starts at the network layer, I would then > > > > suggest data always at L3 and nhoff 0. > > For eth_get_headlen we need to manually parse 802.1q header. And for RFS > > case as well (unless I'm missing something). > > > > > > This can be derived in the same manner as __skb_flow_dissect > > > > already does if !data, using only skb_network_offset. > > > > > > > > From a quick scan, skb_mac_offset should also be valid in all cases > > > > where the flow dissector is called today, so the other can be computed, too. > > > > > > > > But this is less obvious. For instance, tun_get_user calls into the flow > > > > dissector up to three times (wow) and IFF_TUN has no link layer > > > > (ARPHRD_NONE). And then there are also fun variable length link layer > > > > protocols to deal with.. > > > > > > ahh. ok. Can we guarantee some stable position? > > I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header, > > post RFS it will point to L3. The only thing we can do is to have > > nhoff=0 (and adjust ctx->data accordingly) when the main bpf > > flow dissector procedure is called. But that would require bringing > > this new kernel context (bpf_flow_dissector) into bpf/stable. > > (And it's not clear what's the benefit, since tail calls would still > > have to look at that offset). > > The flow dissector can be called also before and after tunneling, in > which case skb_network_offset points to an inner header. Or after > MPLS, which stumps a flow dissector called earlier as that has no > information about the encapsulated protocol. > > I don't think that there should be a goal that flow dissection starts > at the same point in the packet for all callsites along the datapath. > As long as it always starts at a known ETH_P_.. type protocol header > the program should be able to parse that. That is how the non-BPF > flow dissector works. > > > > Current bpf_flow_dissect_get_header assumes that > > > ctx->data + ctx->flow_keys->thoff point to IP, right? > > Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header. > > And it's only for the "main" call; bpf program adjusts this thoff > > to make sure that tail calls preserve some sense of progress (so it > > eventually points to L4 and that's what we export back). > > > > > Based on what Stanislav saying above even that is not a guarantee? > > > I'm struggling to see how users can wrap their heads around this. > > > It seems bpf_flow.c will become the only prog that can deal with > > > this range of possible inputs. > > > > > > I propose to start with the doc that describes all cases, where > > > things point to and how prog suppose to parse that. > > Yeah, that is what I was going to propose - add a doc along with the > > patch series. I don't see how we can make it simple(r) at this point :-( > > Does it have to be simpler? A flow dissector should be ready to > dissect VLAN tags. That's the only complication here? I don't see how it can be made simpler. That's the context from which existing __skb_flow_dissect is called and that's what we have to dissect from the BPF as well. We can try to make nhoff to be 0 when the dissector is called, that's probably the only simplification we can attempt to do (but, as I said previously, it requires bringing new kernel context to bpf/stable and seems more complicated than necessary). Let me prepare a series for bpf/stable with the small doc describing BPF flow dissector environment. We can continue the discussion from there :-) > > I can try to document everything so users don't have to read the > > kernel code to understand how to write the bpf flow dissector programs.