Re: v4 clientid uniquifiers in containers/namespaces

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Tue, 08 Feb 2022 09:29:21 -0500

On 8 Feb 2022, at 8:45, Trond Myklebust wrote:

> On Tue, 2022-02-08 at 06:32 -0500, Benjamin Coddington wrote:
>> On 7 Feb 2022, at 18:59, Chuck Lever III wrote:
>>
>>>> On Feb 7, 2022, at 2:38 PM, Trond Myklebust
>>>> <trondmy@xxxxxxxxxxxxxxx>
>>>> wrote:
>>>>
>>>> On Mon, 2022-02-07 at 15:49 +0000, Chuck Lever III wrote:
>>>>>
>>>>>
>>>>>> On Feb 7, 2022, at 9:05 AM, Benjamin Coddington
>>>>>> <bcodding@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> On 5 Feb 2022, at 14:50, Benjamin Coddington wrote:
>>>>>>
>>>>>>> On 5 Feb 2022, at 13:24, Trond Myklebust wrote:
>>>>>>>
>>>>>>>> On Sat, 2022-02-05 at 10:03 -0500, Benjamin Coddington
>>>>>>>> wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Is anyone using a udev(-like) implementation with
>>>>>>>>> NETLINK_LISTEN_ALL_NSID?
>>>>>>>>> It looks like that is at least necessary to allow the
>>>>>>>>> init
>>>>>>>>> namespaced
>>>>>>>>> udev
>>>>>>>>> to receive notifications on
>>>>>>>>> /sys/fs/nfs/net/nfs_client/identifier,
>>>>>>>>> which
>>>>>>>>> would be a pre-req to automatically uniquify in
>>>>>>>>> containers.
>>>>>>>>>
>>>>>>>>> I'md interested since it will inform whether I need to
>>>>>>>>> send
>>>>>>>>> patches
>>>>>>>>> to
>>>>>>>>> systemd's udev, and potentially open the can of worms
>>>>>>>>> over
>>>>>>>>> there.
>>>>>>>>> Yet its
>>>>>>>>> not yet clear to me how an init namespaced udev process
>>>>>>>>> can
>>>>>>>>> write to
>>>>>>>>> a netns
>>>>>>>>> sysfs path.
>>>>>>>>>
>>>>>>>>> Another option might be to create yet another
>>>>>>>>> daemon/tool
>>>>>>>>> that would
>>>>>>>>> listen
>>>>>>>>> specifically for these notifications.  Ugh.
>>>>>>>>>
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't understand. Why do you need a new daemon/tool?
>>>>>>
>>>>>> Because what we've got only works for the init namespace.
>>>>>>
>>>>>> Udev won't get kobject notifications because its not using
>>>>>> NETLINK_LISTEN_ALL_NSIDs.
>>>>>>
>>>>>> We need to figure out if we want:
>>>>>>
>>>>>> 1) the init namespace udevd to handle all client_id
>>>>>> uniquifiers
>>>>>> 2) we expect network namespaces to run their own udevd
>>>>>> 3) or both.
>>>>>>
>>>>>> I think 2 violates "least surprise", and 3 might not be
>>>>>> something
>>>>>> anyone
>>>>>> ever wants.  If they do, we can fix it at that point.
>>>>>>
>>>>>> So to make 1 work, we can try to change udevd, or maybe just
>>>>>> hacking about
>>>>>> with nfs_netns_object_child_ns_type will be sufficient.
>>>>>
>>>>> I agree that 1 seems like the preferred approach, though
>>>>> I don't have a technical suggestion at this point.
>>>>>
>>>>
>>>> I strongly disagree. (1) requires the init namespace to have
>>>> intimate
>>>> knowledge of container internals.
>>
>> Not really, we're just distinguishing NFS clients in containers from
>> NFS
>> clients on the host.  That doesn't require intimate knowledge, only a
>> mechanism to create a unique value per-container.
>>
>>>> Why do we need to make that a requirement? That violates the
>>>> expectation
>>>> that containers are stateless by default, and also the
>>>> expectation
>>>> that
>>>> they operate independently of the environment.
>>
>> I'm not familiar with the expectation that containers are stateless
>> by
>> default, or that they operate independently of the environment.
>>
>
> Put differently: do you expect QEMU/KVM and VMware ESX to have to know
> a priori that a VM is going to use NFSv4, and force them to have to
> modify the VM state accordingly? No, of course not. So why do you think
> this is a good idea for containers?

Well, I don't think /that's/ a good idea, no, but I don't think the
comparison is valid.  I wouldn't equate containers with VMs when it comes to
configuration or state because VMs attempt to create a nearly isolated
processing environment, while containers or namespaces are a complete
mish-mash of objects, state, and paradigms.  A lot of what happens in a
particular set of namespaces can happen and affect objects in init too.

The immediate example is the very problem we're trying to fix: nfs clients in
netns can disrupt/reclaim state from the init namespace client.

> This is exactly the problem with the keyring upcall mechanism, and why
> it is completely useless on a modern system. It relies on the top level
> knowing what the containers are doing and how they are configured.

We're actually talking over this problem while working TLS, and I agree that
keyrings need changes to allow userspace callouts to be "routed", and that
configuration must come from within the containers.  And lacking a container
taking responsibility for it, it is up to the host to do something sane.

> Imagine if you want to nest containers (yes, people do that - just
> Google "nested docker containers"). Your top level process would have
> to know not just how the first level of containers is configured
> (network details, user mappings, ...), but also details about how the
> child containers, that it is not directly managing, are configured.
> It's just not practical.

Oh yeah, I know all about it.  Its quite a mess, and every subsystem that
has to account for all of this does it a little differently.

>> Can't we just uniquify the namespaced NFS client ourselves, while
>> still
>> exposing /sys/fs/nfs/net/nfs_client/identifier within the namespace? 
>> That
>> way if someone want to run udev or use their own method of persistent
>> id
>> its available to them within the container so they can.  Then we can
>> move
>> forward because the problem of distinguishing clients between the
>> host
>> and
>> netns is automagically solved.
>
> That could be done.

Ok, I'm eyeballing a sha1 of the init namespace uniquifier and
peernet2id_alloc(new_net, init_net).. but means the NFS client would grow a
dependency on CRYPTO and CRYPTO_SHA1.

hm.

Ben