Re: ovl: Allow layers from anonymous mount namespaces?

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 23 Jan 2025 20:21:04 +0100

On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote:
> Hi,
> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
> it looks likely to be well-suited to simplifying the task a container
> runtime must take on in order to provide a set of properly idmapped
> lower layers for a user namespaced container. Currently in containerd,
> this is done by creating bindmounts for each required lower layer in
> order to apply idmapping to them. Each of these bindmounts must be
> briefly attached to some path-resolvable mountpoint before the overlay
> is created, which seems less than ideal and is contributing to some
> cleanup headaches e.g. when other software that may be present jumps on
> the new mount and starts security scanning it or whatnot.
> 
> In order to better isolate the idmap bindmounts I was hoping to do
> something like:
> 
> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);
> 
> opfd = open_tree(-1, "/path/to/unmapped/layer",
> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
> dfd = openat(opfd, ".", O_DIRECTORY, mode);

Unless I forgot detaile, openat() shouldn't be needed as speciyfing
layers via O_PATH file descriptors should just work.

> 
> fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
> // ...other ovl_ctx fsconfigs...
> fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 
> ...and this *almost* works in 6.13. The result of something like this is
> that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
> lowerpath" in dmesg. Investigating a bit, the cause is that the mount
> represented by opfd is placed in a newly allocated mount namespace
> containing only itself. When overlayfs then tries to make its own
> private copy of that mount, it uses clone_private_mount() which subjects
> any source mount to a test that its mount namespace is the task's mount
> namespace. If I just remove this one check, then userspace code like the
> above seems to happily work.
> 
> I've tried various things in userspace to move opfd to the task's mount
> namespace _without_ also attaching it to a directory tree somewhere as
> we do today, but have come up short on a way to do that.
> 
> Assuming what I'm trying to do is in line with the intended use case for
> these new(er) APIs, I'm wondering if some relatively small kernel change
> might be the best way to enable this? Perhaps clone_private_mount(),
> which seems to only be used in-tree by overlayfs, could also tolerate
> mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
> something?

This should be doable but requires some changes to
clone_private_mount(). I just sent an RFC patchset.
The patchset is entirely untested as of now.