On Wed, Oct 02, 2019 at 09:43:49AM -0700, John Fastabend wrote: > Toke Høiland-Jørgensen wrote: > > This series adds support for executing multiple XDP programs on a single > > interface in sequence, through the use of chain calls, as discussed at the Linux > > Plumbers Conference last month: > > > > https://linuxplumbersconf.org/event/4/contributions/460/ > > > > # HIGH-LEVEL IDEA > > > > The basic idea is to express the chain call sequence through a special map type, > > which contains a mapping from a (program, return code) tuple to another program > > to run in next in the sequence. Userspace can populate this map to express > > arbitrary call sequences, and update the sequence by updating or replacing the > > map. > > > > The actual execution of the program sequence is done in bpf_prog_run_xdp(), > > which will lookup the chain sequence map, and if found, will loop through calls > > to BPF_PROG_RUN, looking up the next XDP program in the sequence based on the > > previous program ID and return code. > > > > An XDP chain call map can be installed on an interface by means of a new netlink > > attribute containing an fd pointing to a chain call map. This can be supplied > > along with the XDP prog fd, so that a chain map is always installed together > > with an XDP program. > > > > # PERFORMANCE > > > > I performed a simple performance test to get an initial feel for the overhead of > > the chain call mechanism. This test consists of running only two programs in > > sequence: One that returns XDP_PASS and another that returns XDP_DROP. I then > > measure the drop PPS performance and compare it to a baseline of just a single > > program that only returns XDP_DROP. > > > > For comparison, a test case that uses regular eBPF tail calls to sequence two > > programs together is also included. Finally, because 'perf' showed that the > > hashmap lookup was the largest single source of overhead, I also added a test > > case where I removed the jhash() call from the hashmap code, and just use the > > u32 key directly as an index into the hash bucket structure. > > > > The performance for these different cases is as follows (with retpolines disabled): > > retpolines enabled would also be interesting. > > > > > | Test case | Perf | Add. overhead | Total overhead | > > |---------------------------------+-----------+---------------+----------------| > > | Before patch (XDP DROP program) | 31.0 Mpps | | | > > | After patch (XDP DROP program) | 28.9 Mpps | 2.3 ns | 2.3 ns | > > IMO even 1 Mpps overhead is too much for a feature that is primarily about > ease of use. Sacrificing performance to make userland a bit easier is hard > to justify for me when XDP _is_ singularly about performance. Also that is > nearly 10% overhead which is fairly large. So I think going forward the > performance gab needs to be removed. Fully agree, for the case where this facility is not used, it must have *zero* overhead. This is /one/ map flavor, in future there will be other facilities with different use-cases, but we cannot place them all into the critical fast-path. Given this is BPF, we have the flexibility that this can be hidden behind the scenes by rewriting and therefore only add overhead when used. What I also see as a red flag with this proposal is the fact that it's tied to XDP only because you need to go and hack bpf_prog_run_xdp() all the way to fetch xdp->rxq->dev->xdp_chain_map even though the map/concept itself is rather generic and could be used in various other program types as well. I'm very sure that once there, people would request it. Therefore, better to explore a way where this has no changes to BPF_PROG_RUN() similar to the original tail call work. Thanks, Daniel