On 8 Feb 2022, at 8:45, Trond Myklebust wrote: > On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote: >> On 7 Feb 2022, at 18:59, Chuck Lever III wrote: >> >>>> On Feb 7, 2022, at 2:38 PM, Trond Myklebust >>>> <trondmy@xxxxxxxxxxxxxxx> >>>> wrote: >>>> >>>> On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote: >>>>> >>>>> >>>>>> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington >>>>>> <bcodding@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote: >>>>>> >>>>>>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote: >>>>>>> >>>>>>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington >>>>>>>> wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Is anyone using a udev(-like) implementation with >>>>>>>>> NETLINK_LISTEN_ALL_NSID? >>>>>>>>> It looks like that is at least necessary to allow the >>>>>>>>> init >>>>>>>>> namespaced >>>>>>>>> udev >>>>>>>>> to receive notifications on >>>>>>>>> /sys/fs/nfs/net/nfs_client/identifier, >>>>>>>>> which >>>>>>>>> would be a pre-req to automatically uniquify in >>>>>>>>> containers. >>>>>>>>> >>>>>>>>> I'md interested since it will inform whether I need to >>>>>>>>> send >>>>>>>>> patches >>>>>>>>> to >>>>>>>>> systemd's udev, and potentially open the can of worms >>>>>>>>> over >>>>>>>>> there. >>>>>>>>> Yet its >>>>>>>>> not yet clear to me how an init namespaced udev process >>>>>>>>> can >>>>>>>>> write to >>>>>>>>> a netns >>>>>>>>> sysfs path. >>>>>>>>> >>>>>>>>> Another option might be to create yet another >>>>>>>>> daemon/tool >>>>>>>>> that would >>>>>>>>> listen >>>>>>>>> specifically for these notifications. Ugh. >>>>>>>>> >>>>>>>>> Ben >>>>>>>>> >>>>>>>> >>>>>>>> I don't understand. Why do you need a new daemon/tool? >>>>>> >>>>>> Because what we've got only works for the init namespace. >>>>>> >>>>>> Udev won't get kobject notifications because its not using >>>>>> NETLINK_LISTEN_ALL_NSIDs. >>>>>> >>>>>> We need to figure out if we want: >>>>>> >>>>>> 1) the init namespace udevd to handle all client_id >>>>>> uniquifiers >>>>>> 2) we expect network namespaces to run their own udevd >>>>>> 3) or both. >>>>>> >>>>>> I think 2 violates "least surprise", and 3 might not be >>>>>> something >>>>>> anyone >>>>>> ever wants. If they do, we can fix it at that point. >>>>>> >>>>>> So to make 1 work, we can try to change udevd, or maybe just >>>>>> hacking about >>>>>> with nfs_netns_object_child_ns_type will be sufficient. >>>>> >>>>> I agree that 1 seems like the preferred approach, though >>>>> I don't have a technical suggestion at this point. >>>>> >>>> >>>> I strongly disagree. (1) requires the init namespace to have >>>> intimate >>>> knowledge of container internals. >> >> Not really, we're just distinguishing NFS clients in containers from >> NFS >> clients on the host. That doesn't require intimate knowledge, only a >> mechanism to create a unique value per-container. >> >>>> Why do we need to make that a requirement? That violates the >>>> expectation >>>> that containers are stateless by default, and also the >>>> expectation >>>> that >>>> they operate independently of the environment. >> >> I'm not familiar with the expectation that containers are stateless >> by >> default, or that they operate independently of the environment. >> > > Put differently: do you expect QEMU/KVM and VMware ESX to have to know > a priori that a VM is going to use NFSv4, and force them to have to > modify the VM state accordingly? No, of course not. So why do you think > this is a good idea for containers? Well, I don't think /that's/ a good idea, no, but I don't think the comparison is valid. I wouldn't equate containers with VMs when it comes to configuration or state because VMs attempt to create a nearly isolated processing environment, while containers or namespaces are a complete mish-mash of objects, state, and paradigms. A lot of what happens in a particular set of namespaces can happen and affect objects in init too. The immediate example is the very problem we're trying to fix: nfs clients in netns can disrupt/reclaim state from the init namespace client. > This is exactly the problem with the keyring upcall mechanism, and why > it is completely useless on a modern system. It relies on the top level > knowing what the containers are doing and how they are configured. We're actually talking over this problem while working TLS, and I agree that keyrings need changes to allow userspace callouts to be "routed", and that configuration must come from within the containers. And lacking a container taking responsibility for it, it is up to the host to do something sane. > Imagine if you want to nest containers (yes, people do that - just > Google "nested docker containers"). Your top level process would have > to know not just how the first level of containers is configured > (network details, user mappings, ...), but also details about how the > child containers, that it is not directly managing, are configured. > It's just not practical. Oh yeah, I know all about it. Its quite a mess, and every subsystem that has to account for all of this does it a little differently. >> Can't we just uniquify the namespaced NFS client ourselves, while >> still >> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace? >> That >> way if someone want to run udev or use their own method of persistent >> id >> its available to them within the container so they can. Then we can >> move >> forward because the problem of distinguishing clients between the >> host >> and >> netns is automagically solved. > > That could be done. Ok, I'm eyeballing a sha1 of the init namespace uniquifier and peernet2id_alloc(new_net, init_net).. but means the NFS client would grow a dependency on CRYPTO and CRYPTO_SHA1. hm. Ben