On Wed, 2024-10-02 at 15:55 -0300, Jason Gunthorpe wrote: > On Mon, Sep 16, 2024 at 01:30:54PM +0200, James Gowans wrote: > > Now actually implementing the serialise callback for iommufd. > > On KHO activate, iterate through all persisted domains and write their > > metadata to the device tree format. For now just a few fields are > > serialised to demonstrate the concept. To actually make this useful a > > lot more field and related objects will need to be serialised too. > > But isn't that a rather difficult problem? The "a lot more fields" > include things like pointers to the mm struct, the user_struct and > task_struct, then all the pinning accounting as well. > > Coming work extends this to memfds and more is coming. I would expect > this KHO stuff to use the memfd-like path to access the physical VM > memory too. > > I think expecting to serialize and restore everything like this is > probably much too complicated. On reflection I think you're right - this will be complex both from a development and a maintenance perspective, trying to make sure we serialise all the necessary state and reconstruct it correctly. Even more complex when structs are refactored/changed across kernel versions. An important requirement of this functionality is the ability to kexec between different kernel versions including going back to an older kernel version in the case of a rollback. So, let's look at other options: > > If you could just retain a small portion and then directly reconstruct > the missing parts it seems like it would be more maintainable. I think we have two other possible approaches here: 1. What this RFC is sketching out, serialising fields from the structs and setting those fields again on deserialise. As you point out this will be complicated. 2. Get userspace to do the work: userspace needs to re-do the ioctls after kexec to reconstruct the objects. My main issue with this approach is that the kernel needs to do some sort of trust but verify approach to ensure that userspace constructs everything the same way after kexec as it was before kexec. We don't want to end up in a state where the iommufd objects don't match the persisted page tables. 3. Serialise and reply the ioctls. Ioctl APIs and payloads should (must?) be stable across kernel versions. If IOMMUFD records the ioctls executed by userspace then it could replay them as part of deserialise and give userspace a handle to the resulting objects after kexec. This way we are guaranteed consistent iommufd / IOAS objects. By "consistent" I mean they are the same as before kexec and match the persisted page tables. By having the kernel do this it means it doesn't need to depend on userspace doing the correct thing. What do you think of this 3rd approach? I can try to sketch it out and send another RFC if you think it sounds reasonable. > > Ie "recover" a HWPT from a KHO on a manually created a IOAS with the > right "memfd" for the backing storage. Then the recovery can just > validate that things are correct and adopt the iommu_domain as the > hwpt. This sounds more like option 2 where we expect userspace to re-drive the ioctls, but verify that they have corresponding payloads as before kexec so that iommufd objects are consistent with persisted page tables. If the kernel is doing verification wouldn't it be better for the kernel to do the ioctl work itself and give the resulting objects to userspace? > > Eventually you'll want this to work for the viommus as well, and that > seems like a lot more tricky complexity.. > > Jason