ovl: Allow layers from anonymous mount namespaces?

Mike Baynton <mike@xxxxxxxxxxxx> · Wed, 22 Jan 2025 22:18:17 -0600

Hi,
I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
it looks likely to be well-suited to simplifying the task a container
runtime must take on in order to provide a set of properly idmapped
lower layers for a user namespaced container. Currently in containerd,
this is done by creating bindmounts for each required lower layer in
order to apply idmapping to them. Each of these bindmounts must be
briefly attached to some path-resolvable mountpoint before the overlay
is created, which seems less than ideal and is contributing to some
cleanup headaches e.g. when other software that may be present jumps on
the new mount and starts security scanning it or whatnot.

In order to better isolate the idmap bindmounts I was hoping to do
something like:

ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);

opfd = open_tree(-1, "/path/to/unmapped/layer",
OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
dfd = openat(opfd, ".", O_DIRECTORY, mode);

fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
// ...other ovl_ctx fsconfigs...
fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

...and this *almost* works in 6.13. The result of something like this is
that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
lowerpath" in dmesg. Investigating a bit, the cause is that the mount
represented by opfd is placed in a newly allocated mount namespace
containing only itself. When overlayfs then tries to make its own
private copy of that mount, it uses clone_private_mount() which subjects
any source mount to a test that its mount namespace is the task's mount
namespace. If I just remove this one check, then userspace code like the
above seems to happily work.

I've tried various things in userspace to move opfd to the task's mount
namespace _without_ also attaching it to a directory tree somewhere as
we do today, but have come up short on a way to do that.

Assuming what I'm trying to do is in line with the intended use case for
these new(er) APIs, I'm wondering if some relatively small kernel change
might be the best way to enable this? Perhaps clone_private_mount(),
which seems to only be used in-tree by overlayfs, could also tolerate
mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
something?

Thanks
Mike