Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > On 3/5/20 11:50 PM, Alexei Starovoitov wrote: >> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote: >>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote: >>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote: >>>>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: >>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote: >>> [...] >>>>> Anyway, what I was trying to express: >>>>> >>>>>> Still that doesn't mean that pinned link is 'immutable'. >>>>> >>>>> I don't mean 'immutable' in the sense that it cannot be removed ever. >>>>> Just that we may end up in a situation where an application can see a >>>>> netdev with an XDP program attached, has the right privileges to modify >>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I >>>>> misunderstanding your proposal? >>>>> >>>>> Amending my example from before, this could happen by: >>>>> >>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to >>>>> /sys/fs/bpf/myprog >>>>> >>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at >>>>> /sys >>>>> >>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so >>>>> xdp-loader can't get access to the original bpf_link; but the XDP >>>>> program is still attached to eth0. >>>> >>>> The key to decide is whether moving netdev across netns should be allowed >>>> when xdp attached. I think it should be denied. Even when legacy xdp >>>> program is attached, since it will confuse user space managing part. >>> >>> There are perfectly valid use cases where this is done already today (minus >>> bpf_link), for example, consider an orchestrator that is setting up the BPF >>> program on the device, moving to the newly created application pod during >>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/ >>> mount instance and if unprivileged cannot remove the BPF prog from the dev >>> either. We do something like this in case of ipvlan, meaning, we attach a >>> rootlet prog that calls into single slot of a tail call map, move it to the >>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs >>> instance we manage the pinned tail call map to update the main programs in >>> that single slot w/o having to switch any netns later on. >> >> Right. You mentioned this use case before, but I managed to forget about it. >> Totally makes sense for prog to stay attached to netdev when it's moved. >> I think pod manager would also prefer that pod is not able to replace >> xdp prog from inside the container. It sounds to me that steps 1,2,3 above >> is exactly the desired behavior. Otherwise what stops some application >> that started in a pod to override it? > > Generally, yes, and it shouldn't need to care nor see what is happening in > /sys/fs/bpf/ from the orchestrator at least (or could potentially have its > own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior > should be that orchestrator does all the setup out of its own namespace, > then moves everything over to the newly created target namespace and e.g. > only if the pod has the capable(cap_sys_admin) permissions, it could mess > around with anything attached there, or via similar model as done in [0] > when there is a master device. Yup, I can see how this can be a reasonable use case where you *would* want the locking. However, my concern is that there should be a way for an admin to recover from this (say, if it happens by mistake, or a misbehaving application). Otherwise, I fear we'll end up with support cases where the only answer is "try rebooting", which is obviously not ideal. > Last time I looked, there is a down/up cycle on the dev upon netns > migration and it flushes e.g. attached qdiscs afaik, so there are > limitations that still need to be addressed. Not sure atm if same is > happening to XDP right now. XDP programs will stay attached. devmaps (for redirect) have a notifier that will remove devices when they move out of a namespace. Not sure if there are any other issues with netns moves somewhere. > In this regards veth devices are a bit nicer to work with since > everything can be attached on hostns ingress w/o needing to worry on > the peer dev in the pod's netns. Presumably the XDP EGRESS hook that David Ahern is working on will make this doable for XDP on veth as well? -Toke