On 1/23/25 13:21, Christian Brauner wrote: > On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote: >> Hi, >> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as >> it looks likely to be well-suited to simplifying the task a container >> runtime must take on in order to provide a set of properly idmapped >> lower layers for a user namespaced container. Currently in containerd, >> this is done by creating bindmounts for each required lower layer in >> order to apply idmapping to them. Each of these bindmounts must be >> briefly attached to some path-resolvable mountpoint before the overlay >> is created, which seems less than ideal and is contributing to some >> cleanup headaches e.g. when other software that may be present jumps on >> the new mount and starts security scanning it or whatnot. >> >> In order to better isolate the idmap bindmounts I was hoping to do >> something like: >> >> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC); >> >> opfd = open_tree(-1, "/path/to/unmapped/layer", >> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC); >> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */); >> dfd = openat(opfd, ".", O_DIRECTORY, mode); > > Unless I forgot detaile, openat() shouldn't be needed as speciyfing > layers via O_PATH file descriptors should just work. O_PATH ones currently result in EBADF, iirc just because fsconfig with FSCONFIG_SET_FD looks up the file descriptor in a way that masks O_PATH. This took some time to work out too, but doesn't strike me as a huge deal. Although I suppose it's one of those things that if it were improved far down the road would probably lead to next to nobody removing the openat(). > >> >> fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd); >> // ...other ovl_ctx fsconfigs... >> fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0); >> >> ...and this *almost* works in 6.13. The result of something like this is >> that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone >> lowerpath" in dmesg. Investigating a bit, the cause is that the mount >> represented by opfd is placed in a newly allocated mount namespace >> containing only itself. When overlayfs then tries to make its own >> private copy of that mount, it uses clone_private_mount() which subjects >> any source mount to a test that its mount namespace is the task's mount >> namespace. If I just remove this one check, then userspace code like the >> above seems to happily work. >> >> I've tried various things in userspace to move opfd to the task's mount >> namespace _without_ also attaching it to a directory tree somewhere as >> we do today, but have come up short on a way to do that. >> >> Assuming what I'm trying to do is in line with the intended use case for >> these new(er) APIs, I'm wondering if some relatively small kernel change >> might be the best way to enable this? Perhaps clone_private_mount(), >> which seems to only be used in-tree by overlayfs, could also tolerate >> mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or >> something? > > This should be doable but requires some changes to > clone_private_mount(). I just sent an RFC patchset. > The patchset is entirely untested as of now. That's awesome, I really appreciate your prompt attention to this! Applied and confirmed your patch works for my use case. Thanks Mike