Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > On 3/6/20 9:31 AM, Toke Høiland-Jørgensen wrote: >> Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: >>> On 3/5/20 11:50 PM, Alexei Starovoitov wrote: >>>> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote: >>>>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote: >>>>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote: >>>>>>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: >>>>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote: >>>>> [...] >>>>>>> Anyway, what I was trying to express: >>>>>>> >>>>>>>> Still that doesn't mean that pinned link is 'immutable'. >>>>>>> >>>>>>> I don't mean 'immutable' in the sense that it cannot be removed ever. >>>>>>> Just that we may end up in a situation where an application can see a >>>>>>> netdev with an XDP program attached, has the right privileges to modify >>>>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I >>>>>>> misunderstanding your proposal? >>>>>>> >>>>>>> Amending my example from before, this could happen by: >>>>>>> >>>>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to >>>>>>> /sys/fs/bpf/myprog >>>>>>> >>>>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at >>>>>>> /sys >>>>>>> >>>>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so >>>>>>> xdp-loader can't get access to the original bpf_link; but the XDP >>>>>>> program is still attached to eth0. >>>>>> >>>>>> The key to decide is whether moving netdev across netns should be allowed >>>>>> when xdp attached. I think it should be denied. Even when legacy xdp >>>>>> program is attached, since it will confuse user space managing part. >>>>> >>>>> There are perfectly valid use cases where this is done already today (minus >>>>> bpf_link), for example, consider an orchestrator that is setting up the BPF >>>>> program on the device, moving to the newly created application pod during >>>>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/ >>>>> mount instance and if unprivileged cannot remove the BPF prog from the dev >>>>> either. We do something like this in case of ipvlan, meaning, we attach a >>>>> rootlet prog that calls into single slot of a tail call map, move it to the >>>>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs >>>>> instance we manage the pinned tail call map to update the main programs in >>>>> that single slot w/o having to switch any netns later on. >>>> >>>> Right. You mentioned this use case before, but I managed to forget about it. >>>> Totally makes sense for prog to stay attached to netdev when it's moved. >>>> I think pod manager would also prefer that pod is not able to replace >>>> xdp prog from inside the container. It sounds to me that steps 1,2,3 above >>>> is exactly the desired behavior. Otherwise what stops some application >>>> that started in a pod to override it? >>> >>> Generally, yes, and it shouldn't need to care nor see what is happening in >>> /sys/fs/bpf/ from the orchestrator at least (or could potentially have its >>> own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior >>> should be that orchestrator does all the setup out of its own namespace, >>> then moves everything over to the newly created target namespace and e.g. >>> only if the pod has the capable(cap_sys_admin) permissions, it could mess >>> around with anything attached there, or via similar model as done in [0] >>> when there is a master device. >> >> Yup, I can see how this can be a reasonable use case where you *would* >> want the locking. However, my concern is that there should be a way for >> an admin to recover from this (say, if it happens by mistake, or a >> misbehaving application). Otherwise, I fear we'll end up with support >> cases where the only answer is "try rebooting", which is obviously not >> ideal. > > I'm not quite sure I follow the concern, if you're an admin and have > the right permissions, then you should be able to introspect and > change settings like with anything else there is today. Well, that's what I want to make sure of :) However, I don't think such introspection is possible today? Or at least there's no API exposed to do this, you'll have to go write drgn scripts or something. But I expect an admin will want a command like 'xdp unload eth0 --yes-i-really-really-mean-this', which would override any locking done by bpf_link. So how to implement that? It's not enough to learn 'this bpf_link is pinned at links/id-123 on bpffs', you'll also need to learn on *which* bpffs, and where to find that mountpoint. So how do you do that? Whereas an API that says 'return the bpf_link currently attached to ifindex X' would sidestep this issue; but then, that is basically the netlink API we have already, except it doesn't have the bpf_link abstraction... So why do we need bpf_link? >>> Last time I looked, there is a down/up cycle on the dev upon netns >>> migration and it flushes e.g. attached qdiscs afaik, so there are >>> limitations that still need to be addressed. Not sure atm if same is >>> happening to XDP right now. >> >> XDP programs will stay attached. devmaps (for redirect) have a notifier >> that will remove devices when they move out of a namespace. Not sure if >> there are any other issues with netns moves somewhere. >> >>> In this regards veth devices are a bit nicer to work with since >>> everything can be attached on hostns ingress w/o needing to worry on >>> the peer dev in the pod's netns. >> >> Presumably the XDP EGRESS hook that David Ahern is working on will make >> this doable for XDP on veth as well? > > I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe > I'm still lacking a clear picture on why one should use it. We > currently use various layers where we orchestrate our BPF programs > from the agent. XDP/rx on the phys nic on the one end, BPF sock progs > attached to cgroups on the other end of the spectrum. The processing > in between on virtual devices is mainly tc/BPF-based since everything > is skb based anyway and more flexible also with interaction with the > rest of the stack. There is also not this pain of having to linearize > all the skbs, but at least there's a path to tackle that. I agree that there's not really any reason to use XDP on veth as long as you'll end up with an skb eventually anyway. The only reason to do something different I can think of is if you have an application inside a container using AF_XDP, and you want to carry XDP frames all the way through to that without ever building an skb. Not really sure this is a good deployment model, but I kinda suspect that the NFV guys will want to do this eventually... -Toke