Re: [LSF/MM/BPF TOPIC] block namespaces

Hannes Reinecke <hare@xxxxxxx> · Thu, 10 Jun 2021 17:05:07 +0200

On 6/10/21 4:29 PM, James Bottomley wrote:
> On Thu, 2021-06-10 at 07:49 +0200, Hannes Reinecke wrote:
>> On 6/9/21 8:36 PM, James Bottomley wrote:
>>> On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> I guess it's time to tick off yet another item on my long-term
>>>> to-do list:
>>>>
>>>> Block namespaces
>>>> ----------------
>>>>
>>>> Idea is similar to what network already does: allowing each user
>>>> namespace to have a different 'view' on the existing block
>>>> devices.  EG if the admin creates a ramdisk in one namespace this
>>>> device should not be visible to other namespaces.  But for me the
>>>> most important use-case would be qemu; currently the devices need
>>>> to be set up in the host, even though the host has no business
>>>> touching it as they really belong to the qemu instance.  This
>>>> is causing quite some irritation eg when this device has LVM or
>>>> MD metadata and udev is trying to activate it on the host.
>>>
>>> I suppose the first question is "why block only?"  There are
>>> several existing device namespace proposals which would be more
>>> generic.
>>>
>>
>> Well; I'm more of a storage person, and do know the needs and 
>> shortcomings in that area. Less well so in other areas...
> 
> OK, but this should work for all devices, just like the device cgroup
> if it's going to be an adjunct to it.
> 
>>>> Overall plan is to restrict views of '/dev', '/sys/dev/block' and
>>>> '/sys/block' to only present the devices 'visible' for this
>>>> namespace.
>>>
>>> We actually already have a devices cgroup that does some of this:
>>>
>>> https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
>>>
>> I know. But this essentially is a filter on '/dev' only, and needs to
>> be configured. Which makes it very unwieldy to use.
>> And the contents of sysfs are not modified, so there's a mismatch 
>> between contents in /dev and /sys.
>> Which might cause issues with monitoring tools.
> 
> Firstly, since it does part of what you want, we at least need to
> understand why you think it can't be enhanced to do everything.
> 
> The /sys problem has been discussed many times.  GregKH really doesn't
> like the idea of filtering /sys (and most container people agree), so
> the options available seem to be don't mount /sys in a container or
> emulate it via fuse.  If you pick either does the device cgroup now
> work for you?
> 

... begging the question why he relented for network namespaces.
They _do_ modify sysfs, and that via generic hooks (ie making struct
class namespace-aware).
And as this is a public interface I don't see why I can't be using it...

>>> However, visibility isn't the only problem, for direct passthrough
>>> there's also uevent handling and people have even asked about
>>> module loading.
>>>
>> I am aware, and that's another reason why device cgroup doesn't cut
>> it.
> 
> Christian Brauner is looking at this ... apparently Ubuntu has some
> thumb drive inside lxc container use case that needs it.
> 

Guess with whom I'm discussing on how to implement this.

>>>>   Initially the drivers would keep their global enumeration, but
>>>> plan is to make the drivers namespace-aware, too, such that each
>>>> namespace could have its own driver-specific device enumeration.
>>>
>>> I really wouldn't do this.  Namespace/Cgroup separation should be
>>> kept as high as possible.  If it leaks into the drivers it will
>>> become unmaintainable.  Why do you think you need the drivers to be
>>> aware?  If it's just enumeration, that should all be doable with
>>> the visibility driver unless you want to do things like compact
>>> numbering?
>>>
>> Which is precisely why I mentioned device modifications.
>> On a generic level we can influence the visibility of devices in 
>> relation to namespaces, we cannot influence the devices themselves.
>> This will lead to namespaces seeing disjunct device numbers (ie 8:0
>> and 8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an
>> issue, but  certainly a change in behaviour.
> 
> Well, not necessarily, the pid namespace is an example of a remapping
> namespace.  The same thing could be done for device numbering, but
> there really needs to be a compelling case.  Given our use of hotplug,
> why would any tool assume compact numbering?  And if you don't need
> compact numbering there's no need to bother with remapping.
> 

Which is my assumption, too. I have just mentioned it here for
completeness sake. It's not that I'll be implementing it for now.

>>>> Goal of this topic is to get a consensus on whether block
>>>> namespaces are a feature which would find interest, and also to
>>>> discuss some design details here:
>>>> - Only in certain cases can a namespace be assigned (eg by
>>>> calling
>>>> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we
>>>> handle
>>>> devices for which no namespace can be identified?
>>>> - Shall we allow for different device enumeration per namespace?
>>>> - Into which level should we go with hiding sysfs structures?
>>>>    Is blanking out the higher-level interfaces in /dev and
>>>> /sys/block    enough?
>>>
>>> First question is does the device cgroup do enough for you and if
>>> not what's missing?
>>>
>> See above. sysfs modifications and uevent filtering are missing.
>> This infrastructure for that is already in place thanks to network 
>> namespaces, we 'just' need to make use of it.
>> Additional drawback is the manual configuration of device-cgroup.
> 
> OK, so still, why not fix or enhance the device cgroup?  The current
> proposal seems to want to duplicate it as a namespace.
> 
Primarily because device-group is really limited in its functionality,
and the interface into the device core is, quite frankly, horrible.
(Implementing it under security/device_cgroup.c, and having direct
function calls from fs/block_dev.c and fs/namei.c?
hch will yet at me if I attempted that.)

Re-implementing that as a namespace is the cleaner solution.
Actually, it should be possible to re-implementing device-cgroup on to
of the block namespaces; for that we'll have to expand it to cover all
devices, but you're arguing for that anyway.
So, let's see.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@xxxxxxx			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)