On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote: > Hi, > I've been eagerly awaiting the arrival of lowerdir+ by file handle, as > it looks likely to be well-suited to simplifying the task a container > runtime must take on in order to provide a set of properly idmapped > lower layers for a user namespaced container. Currently in containerd, > this is done by creating bindmounts for each required lower layer in > order to apply idmapping to them. Each of these bindmounts must be > briefly attached to some path-resolvable mountpoint before the overlay > is created, which seems less than ideal and is contributing to some > cleanup headaches e.g. when other software that may be present jumps on > the new mount and starts security scanning it or whatnot. > > In order to better isolate the idmap bindmounts I was hoping to do > something like: > > ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC); > > opfd = open_tree(-1, "/path/to/unmapped/layer", > OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC); > mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */); > dfd = openat(opfd, ".", O_DIRECTORY, mode); Unless I forgot detaile, openat() shouldn't be needed as speciyfing layers via O_PATH file descriptors should just work. > > fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd); > // ...other ovl_ctx fsconfigs... > fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0); > > ...and this *almost* works in 6.13. The result of something like this is > that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone > lowerpath" in dmesg. Investigating a bit, the cause is that the mount > represented by opfd is placed in a newly allocated mount namespace > containing only itself. When overlayfs then tries to make its own > private copy of that mount, it uses clone_private_mount() which subjects > any source mount to a test that its mount namespace is the task's mount > namespace. If I just remove this one check, then userspace code like the > above seems to happily work. > > I've tried various things in userspace to move opfd to the task's mount > namespace _without_ also attaching it to a directory tree somewhere as > we do today, but have come up short on a way to do that. > > Assuming what I'm trying to do is in line with the intended use case for > these new(er) APIs, I'm wondering if some relatively small kernel change > might be the best way to enable this? Perhaps clone_private_mount(), > which seems to only be used in-tree by overlayfs, could also tolerate > mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or > something? This should be doable but requires some changes to clone_private_mount(). I just sent an RFC patchset. The patchset is entirely untested as of now.