Hi Christian, Thanks for the review! On Fri, Mar 07 2025, Christian Brauner wrote: > On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote: >> The File Descriptor Box (FDBox) is a mechanism for userspace to name >> file descriptors and give them over to the kernel to hold. They can >> later be retrieved by passing in the same name. >> >> The primary purpose of FDBox is to be used with Kexec Handover (KHO). >> There are many kinds anonymous file descriptors in the kernel like >> memfd, guest_memfd, iommufd, etc. that would be useful to be preserved >> using KHO. To be able to do that, there needs to be a mechanism to label >> FDs that allows userspace to set the label before doing KHO and to use >> the label to map them back after KHO. FDBox achieves that purpose by >> exposing a miscdevice which exposes ioctls to label and transfer FDs >> between the kernel and userspace. FDBox is not intended to work with any >> generic file descriptor. Support for each kind of FDs must be explicitly >> enabled. > > This makes no sense as a generic concept. If you want to restore shmem > and possibly anonymous inodes files via KHO then tailor the solution to > shmem and anon inodes but don't make this generic infrastructure. This > has zero chances to cover generic files. > > As soon as you're dealing with non-kernel internal mounts that are not > guaranteed to always be there or something that depends on superblock or > mount specific information that can change you're already screwed. This > will end up a giant mess. This is not supportable or maintainable. As Jason mentioned, this it _not_ intended to be a generic concept. My documentation also says that, but perhaps that was not clear enough. It is supposed to work with only specific type of file descriptors that explicitly enable support for it in the context of kexec handover. I think it might be a good idea to have an explicit dependency on KHO so this distinction is a bit clearer. It is also not intended to be completely transparent to userspace, where they magically get their FD back exactly as they put in. I think we should limit the amount of state we want to guarantee, since it directly contributes to our ABI exposure to later kernels. The more state we track, the more complex and inflexible our ABI becomes. So use of this very much needs an enlightened userspace. As an example, with memfd, the main purpose of persistence across kexec is its memory contents. The application needs a way to carry its memory across, and we provide it a mechanism to do so via FDBox. So we only guarantee that the memory contents are preserved, along with some small metadata, instead of the whole inode which is a lot more complex. The application would then need to be aware of it and expect such changes. The idea is also _not_ to have all the FDs of userspace into the box. They should only put in the ones that they specifically need, and re-open the rest normally. For example, in a live update scenario, the VMM can put in the guest_memfds, the iommufds, and so on, but then re-open configuration files or VM metadata via the normal path. > > And struct file should have zero to do with this KHO stuff. It doesn't > need to carry new operations and it doesn't need to waste precious space > for any of this. That is a fair point. The main reason I did it this way is because memfd does not have file_operations of its own. To enable the memfd abstraction, I wanted the FDBox callback to go into memfd, and it can pass it down to shmem or hugetlbfs. Having the pointer in struct file makes that easy. I think I can find ways to make it work via file_operations, so I will do that in the next version. > >> >> While the primary purpose of FDBox is to be used with KHO, it does not >> explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without >> KHO, simply as a way to preserve or transfer FDs when userspace exits. > > This use-case is covered with systemd's fdstore and it's available to > unprivileged userspace. Stashing arbitrary file descriptors in the > kernel in this way isn't a good idea. For one, it can't be arbitrary FDs, but only explicitly enabled ones. Beyond that, while not intended, there is no way to stop userspace from using it as a stash. Stashing FDs is a needed operation for this to work, and there is no way to guarantee in advance that userspace will actually use it for KHO, and not just stash it to grab back later. I think at least having an explicit dependency on CONFIG_KEXEC_HANDOVER can help a bit at least. [...] >> + >> + ret = close_fd(put_fd.fd); >> + if (ret) { >> + struct fdbox_fd *del; >> + >> + del = fdbox_remove_fd(box, put_fd.name); >> + /* >> + * If we fail to remove from list, it means someone else took >> + * the FD out. In that case, they own the refcount of the file >> + * now. >> + */ >> + if (del == box_fd) >> + fput(file); > > This is a racy mess. Why would adding a file to an fdbox be coupled with > closing it concpetually? The caller should close the file descriptor > itself and not do this close_fd() here in the kernel. Makes sense. We can make it a requirement to have all open FDs of the file be closed by userspace before the box can be sealed. > >> + >> + return ret; >> + } >> + >> + return 0; >> +} >> + [...] >> +static long box_fops_unl_ioctl(struct file *filep, >> + unsigned int cmd, unsigned long arg) >> +{ >> + struct fdbox *box = filep->private_data; >> + long ret = -EINVAL; >> + >> + if (!capable(CAP_SYS_ADMIN)) >> + return -EPERM; >> + >> + switch (cmd) { >> + case FDBOX_PUT_FD: >> + ret = fdbox_put_fd(box, arg); >> + break; >> + case FDBOX_UNSEAL: >> + ret = fdbox_unseal(box); >> + break; >> + case FDBOX_SEAL: >> + ret = fdbox_seal(box); >> + break; >> + case FDBOX_GET_FD: >> + ret = fdbox_get_fd(box, arg); >> + break; > > How does userspace know what file descriptors are in this fdbox if only > put and get are present? Userspace just remembers the names and > otherwise it simply leaks files that no one remembered? For now, it is supposed to remember that, but having a FDBOX_LIST operation should be simple enough. Will add that in the next revision. Also, if userspace suspects it forgot some, it can always delete the box to clean up the leftover ones. > >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + >> + return ret; >> +} >> + [...] -- Regards, Pratyush Yadav