Hi Jamal, On 10/5/22 9:04 PM, Jamal Hadi Salim wrote: [...]
Let me see if i can summarize the issue of ownership.. It seems there were two users each with root access and one decided they want to be prio 1 and basically deleted the others programs and added themselves to the top? And of course both want to be prio 1. Am i correct? And this feature basically avoids this problem by virtue of fd ownership.
Yes and no ;) In the specific example I gave there was an application bug that led to this race of one evicting the other, so it was not intentional and also not triggered on all the nodes in the cluster, but aside from the example, the issue is generic one for tc BPF users. Not fd ownership, but ownership of BPF link solves this as it does similarly for other existing BPF infra which is one of the motivations as outlined in patch 2 to align this for tc BPF, too.
IIUC, this is an issue of resource contention. Both users who have root access think they should be prio 1. Kubernetes has no controls for this? For debugging, wouldnt listening to netlink events have caught this? I may be misunderstanding - but if both users took advantage of this feature seems the root cause is still unresolved i.e whoever gets there first becomes the owner of the highest prio?
This is independent of K8s core; system applications for observability, runtime enforcement, networking, etc can be deployed as Pods via kube-system namespace into the cluster and live in the host netns. These are typically developed independently by different groups of people. So it all depends on the use cases these applications solve, e.g. if you try to deploy two /main/ CNI plugins which both want to provide cluster networking, it won't fly and this is also generally understood by cluster operators, but there can be other applications also attaching to tc BPF for more specialized functions (f.e. observing traffic flows, setting EDT tstamp for subsequent fq, etc) and interoperability can be provided to a certain degree with prio settings & unspec combo to continue the pipeline. Netlink events would at best only allow to observe the rig being pulled from underneath us, but not prevent it compared to tc BPF links, and given the rise of BPF projects we see in K8s space, it's becoming more crucial to avoid accidental outage just from deploying a new Pod into a running cluster given tc BPF layer becomes more occupied.
Other comments on just this patch (I will pay attention in detail later): My two qualms: 1) Was bastardizing all things TC_ACT_XXX necessary? Maybe you could create #define somewhere visible which refers to the TC_ACT_XXX?
Optional as mentioned in the other thread. It was suggested having enums which become visible via vmlinux BTF as opposed to defines, so my thought was to lower barrier for new developers by making the naming and supported subset more obvious similar/closer to XDP case. I didn't want to pull in new header, but I can move it to pkt_cls.h.
2) Why is xtc_run before tc_run()?
It needs to be first in the list because its the only hook point that has an 'ownership' model in tc BPF layer. If its first we can unequivocally know its owner and ensure its never skipped/bypassed/removed by another BPF program either intentionally or due to users bugs/errors. If we put it after other hooks like cls_bpf we loose the statement because those hooks might 'steal', remove, alter the skb before the BPF link ones are executed. Other option is to make this completely flexible, to the point that Stan made, that is, tcf_classify() is just callback from the array at a fixed position and it's completely up to the user where to add from this layer, but we went with former approach. Thanks, Daniel