Re: ovl: Allow layers from anonymous mount namespaces?

Mike Baynton <mike@xxxxxxxxxxxx> · Thu, 23 Jan 2025 23:40:41 -0600

On 1/23/25 13:21, Christian Brauner wrote:
> On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote:
>> Hi,
>> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
>> it looks likely to be well-suited to simplifying the task a container
>> runtime must take on in order to provide a set of properly idmapped
>> lower layers for a user namespaced container. Currently in containerd,
>> this is done by creating bindmounts for each required lower layer in
>> order to apply idmapping to them. Each of these bindmounts must be
>> briefly attached to some path-resolvable mountpoint before the overlay
>> is created, which seems less than ideal and is contributing to some
>> cleanup headaches e.g. when other software that may be present jumps on
>> the new mount and starts security scanning it or whatnot.
>>
>> In order to better isolate the idmap bindmounts I was hoping to do
>> something like:
>>
>> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);
>>
>> opfd = open_tree(-1, "/path/to/unmapped/layer",
>> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
>> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
>> dfd = openat(opfd, ".", O_DIRECTORY, mode);
> 
> Unless I forgot detaile, openat() shouldn't be needed as speciyfing
> layers via O_PATH file descriptors should just work.

O_PATH ones currently result in EBADF, iirc just because fsconfig with
FSCONFIG_SET_FD looks up the file descriptor in a way that masks O_PATH.
This took some time to work out too, but doesn't strike me as a huge
deal. Although I suppose it's one of those things that if it were
improved far down the road would probably lead to next to nobody
removing the openat().

> 
>>
>> fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
>> // ...other ovl_ctx fsconfigs...
>> fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
>>
>> ...and this *almost* works in 6.13. The result of something like this is
>> that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
>> lowerpath" in dmesg. Investigating a bit, the cause is that the mount
>> represented by opfd is placed in a newly allocated mount namespace
>> containing only itself. When overlayfs then tries to make its own
>> private copy of that mount, it uses clone_private_mount() which subjects
>> any source mount to a test that its mount namespace is the task's mount
>> namespace. If I just remove this one check, then userspace code like the
>> above seems to happily work.
>>
>> I've tried various things in userspace to move opfd to the task's mount
>> namespace _without_ also attaching it to a directory tree somewhere as
>> we do today, but have come up short on a way to do that.
>>
>> Assuming what I'm trying to do is in line with the intended use case for
>> these new(er) APIs, I'm wondering if some relatively small kernel change
>> might be the best way to enable this? Perhaps clone_private_mount(),
>> which seems to only be used in-tree by overlayfs, could also tolerate
>> mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
>> something?
> 
> This should be doable but requires some changes to
> clone_private_mount(). I just sent an RFC patchset.
> The patchset is entirely untested as of now.

That's awesome, I really appreciate your prompt attention to this!
Applied and confirmed your patch works for my use case.

Thanks
Mike