On Mon, 2021-12-06 at 13:08 +0100, Christian Brauner wrote: > On Fri, Dec 03, 2021 at 11:37:14AM -0800, Casey Schaufler wrote: > > On 12/3/2021 10:50 AM, James Bottomley wrote: > > > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote: > > > > On 12/3/21 12:03, James Bottomley wrote: > > > > > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote: > > > > > [...] > > > > > > static int securityfs_init_fs_context(struct fs_context > > > > > > *fc) > > > > > > { > > > > > > + int rc; > > > > > > + > > > > > > + if (fc->user_ns->ima_ns->late_fs_init) { > > > > > > + rc = fc->user_ns->ima_ns->late_fs_init(fc- > > > > > > >user_ns); > > > > > > + if (rc) > > > > > > + return rc; > > > > > > + } > > > > > > fc->ops = &securityfs_context_ops; > > > > > > return 0; > > > > > > } > > > > > I know I suggested this, but to get this to work in general, > > > > > it's going to have to not be specific to IMA, so it's going > > > > > to have to become something generic like a notifier > > > > > chain. The other problem is it's only working still by > > > > > accident: > > > > > > > > I had thought about this also but the rationale was: > > > > > > > > securityfs is compiled due to CONFIG_IMA_NS and the user > > > > namespace exists there and that has a pointer now to > > > > ima_namespace, which can have that callback. I assumed that > > > > other namespaced subsystems could also be reached then via > > > > such a callback, but I don't know. > > > > > > Well securityfs is supposed to exist for LSMs. At some point > > > each of those is going to need to be namespaced, which may > > > eventually be quite a pile of callbacks, which is why I thought > > > of a notifier. > > > > While AppArmor, lockdown and the integrity family use securityfs, > > SELinux and Smack do not. They have their own independent > > filesystems. Implementations of namespacing for each of SELinux and > > Smack have been proposed, but nothing has been adopted. It would be > > really handy to namespace the infrastructure rather than each > > individual LSM, but I fear that's a bigger project than anyone will > > be taking on any time soon. It's likely to encounter many of the > > same issues that I've been dealing with for module stacking. > > The main thing that bothers me is that it uses simple_pin_fs() and > simple_unpin_fs() which I would try hard to get rid of if possible. > The existence of this global pinning logic makes namespacing it > properly more difficult then it needs to be and it creates imho wonky > semantics where the last unmount doesn't really destroy the > superblock. So in the notifier sketch I posted, I got rid of the pinning but only for the non root user namespace use case ... which basically means only for converted consumers of securityfs. The last unmount of securityfs inside the namespace now does destroy the superblock ... I checked. The same isn't true for the last unmount of the root namespace, but that has to be so to keep the current semantics. > Instead subsequents mounts resurface the same superblock. There > might be an inherent design reason why this needs to be this way but > I would advise against these semantics for anything that wants to be > namespaced. Probably the first securityfs mount in init_user_ns can > follow these semantics but ones tied to a non-initial user namespace > should not as the userns can go away. In that case the pinning logic > seems strange as conceptually the userns pins the securityfs mount as > evidenced by the fact that we key by it in get_tree_keyed(). Yes, that's basically what I did: pin if ns == &init_user_ns but don't pin if not. However, I'm still not sure I got the triggers right. We have to trigger the notifier call (which adds the namespaced file entries) from context free, because that's the first place the superblock mount is fully set up ... I can't do it in fill_super because the mount isn't fully initialized (and the locking prevents it). I did manage to get the notifier for teardown triggered from kill_super, though. James