Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > On 6/15/21 1:54 PM, Toke Høiland-Jørgensen wrote: >> Cong Wang <xiyou.wangcong@xxxxxxxxx> writes: > [...] >>>>> I offer two different views here: >>>>> >>>>> 1. If you view a TC filter as an instance as a netdev/qdisc/action, they >>>>> are no different from this perspective. Maybe the fact that a TC filter >>>>> resides in a qdisc makes a slight difference here, but like I mentioned, it >>>>> actually makes sense to let TC filters be standalone, qdisc's just have to >>>>> bind with them, like how we bind TC filters with standalone TC actions. >>>> >>>> You propose something different below IIUC, but I explained why I'm wary of >>>> these unbound filters. They seem to add a step to classifier setup for no real >>>> benefit to the user (except keeping track of one more object and cleaning it >>>> up with the link when done). >>> >>> I am not even sure if unbound filters help your case at all, making >>> them unbound merely changes their residence, not ownership. >>> You are trying to pass the ownership from TC to bpf_link, which >>> is what I am against. >> >> So what do you propose instead? >> >> bpf_link is solving a specific problem: ensuring automatic cleanup of >> kernel resources held by a userspace application with a BPF component. >> Not all applications work this way, but for the ones that do it's very >> useful. But if the TC filter stays around after bpf_link detaches, that >> kinda defeats the point of the automatic cleanup. >> >> So I don't really see any way around transferring ownership somehow. >> Unless you have some other idea that I'm missing? > > Just to keep on brainstorming here, I wanted to bring back Alexei's earlier quote: > > > I think it makes sense to create these objects as part of establishing bpf_link. > > ingress qdisc is a fake qdisc anyway. > > If we could go back in time I would argue that its existence doesn't > > need to be shown in iproute2. It's an object that serves no purpose > > other than attaching filters to it. It doesn't do any queuing unlike > > real qdiscs. > > It's an artifact of old choices. Old doesn't mean good. > > The kernel is full of such quirks and oddities. New api-s shouldn't > > blindly follow them. > > tc qdisc add dev eth0 clsact > > is a useless command with nop effect. > > The whole bpf_link in this context feels somewhat awkward because both are two > different worlds, one accessible via netlink with its own lifetime etc, the other > one tied to fds and bpf syscall. Back in the days we did the cls_bpf integration > since it felt the most natural at that time and it had support for both the ingress > and egress side, along with the direct action support which was added later to have > a proper fast path for BPF. One thing that I personally never liked is that later > on tc sadly became a complex, quirky dumping ground for all the nic hw offloads (I > guess mainly driven from ovs side) for which I have a hard time convincing myself > that this is used at scale in production. Stuff like af699626ee26 just to pick one > which annoyingly also adds to the fast path given distros will just compile in most > of these things (like NET_TC_SKB_EXT)... what if such bpf_link object is not tied > at all to cls_bpf or cls_act qdisc, and instead would implement the tcf_classify_ > {egress,ingress}() as-is in that sense, similar like the bpf_lsm hooks. Meaning, > you could run existing tc BPF prog without any modifications and without additional > extra overhead (no need to walk the clsact qdisc and then again into the cls_bpf > one). These tc BPF programs would be managed only from bpf() via tc bpf_link api, > and are otherwise not bothering to classic tc command (though they could be dumped > there as well for sake of visibility, though bpftool would be fitting too). However, > if there is something attached from classic tc side, it would also go into the old > style tcf_classify_ingress() implementation and walk whatever is there so that nothing > existing breaks (same as when no bpf_link would be present so that there is no extra > overhead). This would also allow for a migration path of multi prog from cls_bpf to > this new implementation. Details still tbd, but I would much rather like such an > approach than the current discussed one, and it would also fit better given we don't > run into this current mismatch of both worlds. So this would entail adding a separate list of BPF programs and run through those at the start of sch_handle_{egress,ingress}() I suppose? And that list of filters would only contain bpf_link-attached BPF programs, sorted by priority like TC filters? And return codes of TC_ACT_OK or TC_ACT_RECLASSIFY would continue through to tcf_classify_{egress,ingress}()? I suppose that could work; we could even stick the second filter list in struct mini_Qdisc and have clsact and bpf_link cooperate on managing that, no? That way it would also be easy to dump the BPF filters via netlink: I do think that will be the least surprising thing to do (so people can at least see there's something there with existing tools). The overhead would be a single extra branch when only one of clsact or bpf_link is in use (to check if the other list of filters is set); that's probably acceptable at this level... -Toke