On Fri, Mar 07, 2025 at 11:14:17AM -0400, Jason Gunthorpe wrote: > On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote: > > On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote: > > > The File Descriptor Box (FDBox) is a mechanism for userspace to name > > > file descriptors and give them over to the kernel to hold. They can > > > later be retrieved by passing in the same name. > > > > > > The primary purpose of FDBox is to be used with Kexec Handover (KHO). > > > There are many kinds anonymous file descriptors in the kernel like > > > memfd, guest_memfd, iommufd, etc. that would be useful to be preserved > > > using KHO. To be able to do that, there needs to be a mechanism to label > > > FDs that allows userspace to set the label before doing KHO and to use > > > the label to map them back after KHO. FDBox achieves that purpose by > > > exposing a miscdevice which exposes ioctls to label and transfer FDs > > > between the kernel and userspace. FDBox is not intended to work with any > > > generic file descriptor. Support for each kind of FDs must be explicitly > > > enabled. > > > > This makes no sense as a generic concept. If you want to restore shmem > > and possibly anonymous inodes files via KHO then tailor the solution to > > shmem and anon inodes but don't make this generic infrastructure. This > > has zero chances to cover generic files. > > We need it to cover a range of FD types in the kernel like iommufd and anonymous inode > vfio. anonymous inode > > It is not "generic" in the sense every FD in the kernel magicaly works > with fdbox, but that any driver/subsystem providing a FD could be > enlightened to support it. > > Very much do not want the infrastructure tied to just shmem and memfd. Anything you can reasonably want will either be an internal shmem mount, devtmpfs, or anonymous inodes. Anything else isn't going to work. > > > As soon as you're dealing with non-kernel internal mounts that are not > > guaranteed to always be there or something that depends on superblock or > > mount specific information that can change you're already screwed. > > This is really targetting at anonymous or character device file > descriptors that don't have issues with mounts. > > Same remark about inode permissions and what not. The successor > kernel would be responsible to secure the FDBOX and when it takes > anything out it has to relabel it if required. > > inode #s and things can change because this is not something like CRIU > that would have state linked to inode numbers. The applications in the > sucessor kernels are already very special, they will need to cope with > inode number changes along with all the other special stuff they do. > > > And struct file should have zero to do with this KHO stuff. It doesn't > > need to carry new operations and it doesn't need to waste precious space > > for any of this. > > Yeah, it should go through file_operations in some way. I'm fine with a new method. There's not going to be three new methods just for the sake of this special-purpose thing. And want this to be part of fs/ and co-maintained by fs people. I'm not yet sold that this needs to be a character device. Because that's fundamentally limiting in how useful this can be. It might be way more useful if this ended up being a separate tiny filesystem where such preserved files are simply shown as named entries that you can open instead of ioctl()ing your way through character devices. But I need to think about that.