Re: v4 clientid uniquifiers in containers/namespaces

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 8 Feb 2022 13:45:53 +0000

On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
> On 7 Feb 2022, at 18:59, Chuck Lever III wrote:
> 
> > > On Feb 7, 2022, at 2:38 PM, Trond Myklebust
> > > <trondmy@xxxxxxxxxxxxxxx> 
> > > wrote:
> > > 
> > > On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
> > > > 
> > > > 
> > > > > On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
> > > > > <bcodding@xxxxxxxxxx> wrote:
> > > > > 
> > > > > On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
> > > > > 
> > > > > > On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
> > > > > > 
> > > > > > > On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington
> > > > > > > wrote:
> > > > > > > > Hi all,
> > > > > > > > 
> > > > > > > > Is anyone using a udev(-like) implementation with
> > > > > > > > NETLINK_LISTEN_ALL_NSID?
> > > > > > > > It looks like that is at least necessary to allow the
> > > > > > > > init
> > > > > > > > namespaced
> > > > > > > > udev
> > > > > > > > to receive notifications on
> > > > > > > > /sys/fs/nfs/net/nfs_client/identifier,
> > > > > > > > which
> > > > > > > > would be a pre-req to automatically uniquify in
> > > > > > > > containers.
> > > > > > > > 
> > > > > > > > I'md interested since it will inform whether I need to
> > > > > > > > send
> > > > > > > > patches
> > > > > > > > to
> > > > > > > > systemd's udev, and potentially open the can of worms
> > > > > > > > over
> > > > > > > > there.
> > > > > > > > Yet its
> > > > > > > > not yet clear to me how an init namespaced udev process
> > > > > > > > can
> > > > > > > > write to
> > > > > > > > a netns
> > > > > > > > sysfs path.
> > > > > > > > 
> > > > > > > > Another option might be to create yet another
> > > > > > > > daemon/tool
> > > > > > > > that would
> > > > > > > > listen
> > > > > > > > specifically for these notifications.  Ugh.
> > > > > > > > 
> > > > > > > > Ben
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't understand. Why do you need a new daemon/tool?
> > > > > 
> > > > > Because what we've got only works for the init namespace.
> > > > > 
> > > > > Udev won't get kobject notifications because its not using
> > > > > NETLINK_LISTEN_ALL_NSIDs.
> > > > > 
> > > > > We need to figure out if we want:
> > > > > 
> > > > > 1) the init namespace udevd to handle all client_id
> > > > > uniquifiers
> > > > > 2) we expect network namespaces to run their own udevd
> > > > > 3) or both.
> > > > > 
> > > > > I think 2 violates "least surprise", and 3 might not be
> > > > > something
> > > > > anyone
> > > > > ever wants.  If they do, we can fix it at that point.
> > > > > 
> > > > > So to make 1 work, we can try to change udevd, or maybe just
> > > > > hacking about
> > > > > with nfs_netns_object_child_ns_type will be sufficient.
> > > > 
> > > > I agree that 1 seems like the preferred approach, though
> > > > I don't have a technical suggestion at this point.
> > > > 
> > > 
> > > I strongly disagree. (1) requires the init namespace to have
> > > intimate
> > > knowledge of container internals.
> 
> Not really, we're just distinguishing NFS clients in containers from
> NFS
> clients on the host.  That doesn't require intimate knowledge, only a
> mechanism to create a unique value per-container.
> 
> > > Why do we need to make that a requirement? That violates the 
> > > expectation
> > > that containers are stateless by default, and also the
> > > expectation 
> > > that
> > > they operate independently of the environment.
> 
> I'm not familiar with the expectation that containers are stateless
> by
> default, or that they operate independently of the environment.
> 

Put differently: do you expect QEMU/KVM and VMware ESX to have to know
a priori that a VM is going to use NFSv4, and force them to have to
modify the VM state accordingly? No, of course not. So why do you think
this is a good idea for containers?

This is exactly the problem with the keyring upcall mechanism, and why
it is completely useless on a modern system. It relies on the top level
knowing what the containers are doing and how they are configured.
Imagine if you want to nest containers (yes, people do that - just
Google "nested docker containers"). Your top level process would have
to know not just how the first level of containers is configured
(network details, user mappings, ...), but also details about how the
child containers, that it is not directly managing, are configured.
It's just not practical.

> > > If you really do want external control over the uuid that is set,
> > > then
> > > it should be pretty trivial to do so by using the standard
> > > container
> > > tools for manipulating the namespace (e.g. to mount a file that
> > > is
> > > under control of the parent as /etc/nfs4-uuid.conf or whatever).
> 
> We're not looking for external control, just automation.  The NFS 
> community
> has decided that udev is the way to go here, so as long as we can get
> the
> notifications to /some/ udev process, I feel confident we can make
> all 
> of
> this transparent.
> 
> The less we have to teach all the container tooling folks, the better
> for us.
> 

Agreed. I'm saying that udev case also allows for top level control if
you think you need it.

> > > However in most cases that I can think of, if the container is
> > > doing
> > > its own NFS mounting, then it is going to have to be set up with
> > > its
> > > own nfs-utils, etc, so there is no reason why we can't also
> > > require
> > > udev.
> 
> I'm not as confident about this as you are.  Network namespaces are 
> pretty
> useful on their own to create independent network configurations or
> to
> isolate hardware interfaces.  We've had a few surprising cases of 
> customers
> using them in creative ways.
> 
> There's a bit of a chicken and egg problem with 2, though.  If the
> nfs
> module is loaded, the kernel notification gets sent as soon as you 
> create
> the namespace.  Its not going to wait for you to move or exec udev
> into 
> that
> network namespace, and the notification is lost.
> 
> Can't we just uniquify the namespaced NFS client ourselves, while
> still
> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace? 
> That
> way if someone want to run udev or use their own method of persistent
> id
> its available to them within the container so they can.  Then we can 
> move
> forward because the problem of distinguishing clients between the
> host 
> and
> netns is automagically solved.

That could be done.

> 
> Where we are today is the host NFS client is uniquified, and all the 
> netns
> clients are distinguished from the host, but not eachother.
> 
> Ben
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx