On 6/10/21 4:29 PM, James Bottomley wrote: > On Thu, 2021-06-10 at 07:49 +0200, Hannes Reinecke wrote: >> On 6/9/21 8:36 PM, James Bottomley wrote: >>> On Thu, 2021-05-27 at 10:01 +0200, Hannes Reinecke wrote: >>>> Hi all, >>>> >>>> I guess it's time to tick off yet another item on my long-term >>>> to-do list: >>>> >>>> Block namespaces >>>> ---------------- >>>> >>>> Idea is similar to what network already does: allowing each user >>>> namespace to have a different 'view' on the existing block >>>> devices. EG if the admin creates a ramdisk in one namespace this >>>> device should not be visible to other namespaces. But for me the >>>> most important use-case would be qemu; currently the devices need >>>> to be set up in the host, even though the host has no business >>>> touching it as they really belong to the qemu instance. This >>>> is causing quite some irritation eg when this device has LVM or >>>> MD metadata and udev is trying to activate it on the host. >>> >>> I suppose the first question is "why block only?" There are >>> several existing device namespace proposals which would be more >>> generic. >>> >> >> Well; I'm more of a storage person, and do know the needs and >> shortcomings in that area. Less well so in other areas... > > OK, but this should work for all devices, just like the device cgroup > if it's going to be an adjunct to it. > >>>> Overall plan is to restrict views of '/dev', '/sys/dev/block' and >>>> '/sys/block' to only present the devices 'visible' for this >>>> namespace. >>> >>> We actually already have a devices cgroup that does some of this: >>> >>> https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt >>> >> I know. But this essentially is a filter on '/dev' only, and needs to >> be configured. Which makes it very unwieldy to use. >> And the contents of sysfs are not modified, so there's a mismatch >> between contents in /dev and /sys. >> Which might cause issues with monitoring tools. > > Firstly, since it does part of what you want, we at least need to > understand why you think it can't be enhanced to do everything. > > The /sys problem has been discussed many times. GregKH really doesn't > like the idea of filtering /sys (and most container people agree), so > the options available seem to be don't mount /sys in a container or > emulate it via fuse. If you pick either does the device cgroup now > work for you? > ... begging the question why he relented for network namespaces. They _do_ modify sysfs, and that via generic hooks (ie making struct class namespace-aware). And as this is a public interface I don't see why I can't be using it... >>> However, visibility isn't the only problem, for direct passthrough >>> there's also uevent handling and people have even asked about >>> module loading. >>> >> I am aware, and that's another reason why device cgroup doesn't cut >> it. > > Christian Brauner is looking at this ... apparently Ubuntu has some > thumb drive inside lxc container use case that needs it. > Guess with whom I'm discussing on how to implement this. >>>> Initially the drivers would keep their global enumeration, but >>>> plan is to make the drivers namespace-aware, too, such that each >>>> namespace could have its own driver-specific device enumeration. >>> >>> I really wouldn't do this. Namespace/Cgroup separation should be >>> kept as high as possible. If it leaks into the drivers it will >>> become unmaintainable. Why do you think you need the drivers to be >>> aware? If it's just enumeration, that should all be doable with >>> the visibility driver unless you want to do things like compact >>> numbering? >>> >> Which is precisely why I mentioned device modifications. >> On a generic level we can influence the visibility of devices in >> relation to namespaces, we cannot influence the devices themselves. >> This will lead to namespaces seeing disjunct device numbers (ie 8:0 >> and 8:8 on ns 1, 8:4 on ns 2). Not that I think that will be an >> issue, but certainly a change in behaviour. > > Well, not necessarily, the pid namespace is an example of a remapping > namespace. The same thing could be done for device numbering, but > there really needs to be a compelling case. Given our use of hotplug, > why would any tool assume compact numbering? And if you don't need > compact numbering there's no need to bother with remapping. > Which is my assumption, too. I have just mentioned it here for completeness sake. It's not that I'll be implementing it for now. >>>> Goal of this topic is to get a consensus on whether block >>>> namespaces are a feature which would find interest, and also to >>>> discuss some design details here: >>>> - Only in certain cases can a namespace be assigned (eg by >>>> calling >>>> 'modprobe', starting iscsiadm, or calling nvme-cli); how do we >>>> handle >>>> devices for which no namespace can be identified? >>>> - Shall we allow for different device enumeration per namespace? >>>> - Into which level should we go with hiding sysfs structures? >>>> Is blanking out the higher-level interfaces in /dev and >>>> /sys/block enough? >>> >>> First question is does the device cgroup do enough for you and if >>> not what's missing? >>> >> See above. sysfs modifications and uevent filtering are missing. >> This infrastructure for that is already in place thanks to network >> namespaces, we 'just' need to make use of it. >> Additional drawback is the manual configuration of device-cgroup. > > OK, so still, why not fix or enhance the device cgroup? The current > proposal seems to want to duplicate it as a namespace. > Primarily because device-group is really limited in its functionality, and the interface into the device core is, quite frankly, horrible. (Implementing it under security/device_cgroup.c, and having direct function calls from fs/block_dev.c and fs/namei.c? hch will yet at me if I attempted that.) Re-implementing that as a namespace is the cleaner solution. Actually, it should be possible to re-implementing device-cgroup on to of the block namespaces; for that we'll have to expand it to cover all devices, but you're arguing for that anyway. So, let's see. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, 90409 Nürnberg GF: F. Imendörffer, HRB 36809 (AG Nürnberg)