On 04/02, Petar Penkov wrote: > On Mon, Apr 1, 2019 at 1:57 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > Short doc on what BPF flow dissector should expect in the input > > __sk_buff and flow_keys. > > > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > > --- > > .../networking/bpf_flow_dissector.txt | 115 ++++++++++++++++++ > > 1 file changed, 115 insertions(+) > > create mode 100644 Documentation/networking/bpf_flow_dissector.txt > > > > diff --git a/Documentation/networking/bpf_flow_dissector.txt b/Documentation/networking/bpf_flow_dissector.txt > > new file mode 100644 > > index 000000000000..513be8e20afb > > --- /dev/null > > +++ b/Documentation/networking/bpf_flow_dissector.txt > > @@ -0,0 +1,115 @@ > > +================== > > +BPF Flow Dissector > > +================== > > + > > +Overview > > +======== > > + > > +Flow dissector is a routine that parses metadata out of the packets. It's > > +used in the various places in the networking subsystem (RFS, flow hash, etc). > > + > > +BPF flow dissector is an attempt to reimplement C-based flow dissector logic > > +in BPF to gain all the benefits of BPF verifier (namely, limits on the > > +number of instructions and tail calls). > > + > > +API > > +=== > > + > > +BPF flow dissector programs operate on an __sk_buff. However, only the > > +limited set of fields is allowed: data, data_end and flow_keys. flow_keys > > +is 'struct bpf_flow_keys' and contains flow dissector input and > > +output arguments. > > + > > +The inputs are: > > + * nhoff - initial offset of the networking header > > + * thoff - initial offset of the transport header, initialized to nhoff > > + * n_proto - L3 protocol type, parsed out of L2 header > > + > > +Flow dissector BPF program should fill out the rest of the 'struct > > +bpf_flow_keys' fields. Input arguments nhoff/thoff/n_proto should be also > > +adjusted accordingly. > > + > > +The return code of the BPF program is either BPF_OK to indicate successful > > +dissection, or BPF_DROP to indicate parsing error. > I don't think this is actually enforced. I believe the current code > just checks if the status is BPF_OK or not, rather than BPF_OK, > BPF_DROP, or neither. It's not universally enforced, but some codepaths in the kernel look at the returned value (e.g. skb_get_poff and eth_get_headlen), so it's better to set the expectations :-) > > + > > +__sk_buff->data > > +=============== > > + > > +In the VLAN-less case, this is what the initial state of the BPF flow > > +dissector looks like: > > ++------+------+------------+-----------+ > > +| DMAC | SMAC | ETHER_TYPE | L3_HEADER | > > ++------+------+------------+-----------+ > > + ^ > > + | > > + +-- flow dissector starts here > > + > > +skb->data + flow_keys->nhoff point to the first byte of L3_HEADER. > > +flow_keys->thoff = nhoff > > +flow_keys->n_proto = ETHER_TYPE > > + > > + > > +In case of VLAN, flow dissector can be called with the two different states. > > + > > +Pre-VLAN parsing: > > ++------+------+------+-----+-----------+-----------+ > > +| DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | > > ++------+------+------+-----+-----------+-----------+ > > + ^ > > + | > > + +-- flow dissector starts here > > + > > +skb->data + flow_keys->nhoff point the to first byte of TCI. > > +flow_keys->thoff = nhoff > > +flow_keys->n_proto = TPID > > + > > +Please note that TPID can be 802.1AD and, hence, BPF program would > > +have to parse VLAN information twice for double tagged packets. > > + > > + > > +Post-VLAN parsing: > > ++------+------+------+-----+-----------+-----------+ > > +| DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | > > ++------+------+------+-----+-----------+-----------+ > > + ^ > > + | > > + +-- flow dissector starts here > > + > > +skb->data + flow_keys->nhoff point the to first byte of L3_HEADER. > > +flow_keys->thoff = nhoff > > +flow_keys->n_proto = ETHER_TYPE > > + > > +In this case VLAN information has been processed before the flow dissector > > +and BPF flow dissector is not required to handle it. > > + > > + > > +The takeaway here is as follows: BPF flow dissector program can be called with > > +the optional VLAN header and should gracefully handle both cases: when single > > +or double VLAN is present and when it is not present. The same program > > +can be called for both cases and would have to be written carefully to > > +handle both cases. > > + > > + > > +Reference Implementation > > +======================== > > + > > +See tools/testing/selftests/bpf/progs/bpf_flow.c for the reference > > +implementation and tools/testing/selftests/bpf/flow_dissector_load.[hc] for > > +the loader. bpftool can be used to load BPF flow dissector program as well. > > + > > +The reference implementation is organized as follows: > > +* jmp_table map that contains sub-programs for each supported L3 protocol > > +* _dissect routine - entry point; it does input n_proto parsing and does > > + bpf_tail_call to the appropriate L3 handler > > + > > +Since BPF at this point doesn't support looping (or any jumping back), > > +jmp_table is used instead to handle multiple levels of encapsulation (and > > +IPv6 options). > > + > > + > > +Current Limitations > > +=================== > > +BPF flow dissector doesn't support exporting all the metadata that in-kernel > > +C-based implementation can export. Notable example is single VLAN (802.1Q) > > +and double VLAN (802.1AD) tags. Please refer to the 'struct bpf_flow_keys' > > +for a set of information that's currently can be exported from the BPF context. > > -- > > 2.21.0.392.gf8f6787159e-goog > >