Hi Jason, Thanks for this great feedback on the approach - it's exactly the sort of thing we were looking for. On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote: > > On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote: > > > The main aspect we’re looking for feedback/opinions on here is the concept of > > putting all persistent state in a single filesystem: combining guest RAM and > > IOMMU pgtables in one store. Also, the question of a hard separation between > > persistent memory and ephemeral memory, compared to allowing arbitrary pages to > > be persisted. Pkernfs does it via a hard separation defined at boot time, other > > approaches could make the carving out of persistent pages dynamic. > > I think if you are going to attempt something like this then the end > result must bring things back to having the same data structures fully > restored. > > It is fine that the pkernfs holds some persistant memory that > guarentees the IOMMU can remain programmed and the VM pages can become > fixed across the kexec > > But once the VMM starts to restore it self we need to get back to the > original configuration: > - A mmap that points to the VM's physical pages > - An iommufd IOAS that points to the above mmap > - An iommufd HWPT that represents that same mapping > - An iommu_domain programmed into HW that the HWPT (A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for now because that's what I know, but will explore iommufd more and reply in more detail about iommufd in the other email thread.) How much of this do you think should be done automatically, vs how much should userspace need to drive? With this RFC userspace basically re- drives everything, including re-injecting the file containing the persistent page tables into the IOMMU domain via VFIO. Part of the reason is simplicity, to avoid having auto-deserialise code paths in the drivers and modules. Another part of the reason so that userspace can get FD handles on the resources. Typically FDs are returned by doing actions like creating VFIO containers. If we make all that automatic then there needs to be some other mechanism for auto- restored resources to present themselves to userspace so that userspace can discover and pick them up again. One possible way to do this would be to populate a bunch of files in procfs for each persisted IOMMU domain that allows userspace to discover and pick it up. Can you expand on what you mean by "A mmap that points to the VM's physical pages?" Are you suggesting that the QEMU process automatically gets something appearing in it's address space? Part of the live update process involves potentially changing the userspace binaries: doing kexec and booting a new system is an opportunity to boot new versions of the userspace binary. So we shouldn't try to preserve too much of userspace state; it's better to let it re-create internal data structures do fresh mmaps. What I'm really asking is: do you have a specific suggestion about how these persistent resources should present themselves to userspace and how userspace can discover them and pick them up? > > Ie you can't just reboot and leave the IOMMU hanging out in some > undefined land - especially in latest kernels! Not too sure what you mean by "undefined land" - the idea is that the IOMMU keeps doing what it was going until userspace comes along re- creates the handles to the IOMMU at which point it can do modifications like change mappings or tear the domain down. This is what deferred attached gives us, I believe, and why I had to change it to be enabled. Just leave the IOMMU domain alone until userspace re-creates it with the original tables. Perhaps I'm missing your point. :-) > > For vt-d you need to retain the entire root table and all the required > context entries too, The restarting iommu needs to understand that it > has to "restore" a temporary iommu_domain from the pkernfs. > You can later reconstitute a proper iommu_domain from the VMM and > atomic switch. Why does it need to go via a temporary domain? The current approach is just to leave the IOMMU domain running as-is via deferred attached, and later when userspace starts up it will create the iommu_domain backed by the same persistent page tables. > > So, I'm surprised to see this approach where things just live forever > in the kernfs, I don't see how "restore" is going to work very well > like this. Can you expand on why the suggested restore path will be problematic? In summary the idea is to re-create all of the "ephemeral" data structures by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU root/context tables pointed at the original persistent page tables. The ephemeral data structures are re-created in userspace but the persistent page tables left alone. This is of course dependent on userspace re- creating things *correctly* - it can easily do the wrong thing. Perhaps this is the issue? Or is there a problem even if userspace is sane. > I would think that a save/restore mentalitity would make more > sense. For instance you could make a special iommu_domain that is fixed > and lives in the pkernfs. The operation would be to copy from the live > iommu_domain to the fixed one and then replace the iommu HW to the > fixed one. > > In the post-kexec world the iommu would recreate that special domain > and point the iommu at it. (copying the root and context descriptions > out of the pkernfs). Then somehow that would get into iommufd and VFIO > so that it could take over that special mapping during its startup. The save and restore model is super interesting - I'm keen to discuss this as an alternative. You're suggesting that IOMMU driver have a serialise phase just before kexec where it dumps everything into persistent memory and then after kexec pulls it back into ephemeral memory. That's probably do-able, but it may increase the critical section latency of live update (every millisecond counts!) and I'm also not too sure what that buys compared to always working with persistent memory and just always being in a state where persistent data is always being used and can be picked up as-is. However, the idea of a serialise and deserialise operation is relevant to a possible alternative to this RFC. My colleague Alex Graf is working on a framework called Kexec Hand Over (KHO): https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx/#r That allows drivers/modules to mark arbitrary memory pages as persistent (ie: not allocatable by next kernel) and to pass over some serialised state across kexec. An alternative to IOMMU domain persistence could be to use KHO to mark the IOMMU root, context and domain page table pages as persistent via KHO. > > Then you'd build the normal operating ioas and hwpt (with all the > right page refcounts/etc) then switch to it and free the pkernfs > memory. > > It seems alot less invasive to me. The special case is clearly a > special case and doesn't mess up the normal operation of the drivers. Yes, a serialise/deserialise approach does have this distinct advantage of not needing to change the alloc/free code paths. Pkernfs requires a shim in the allocator to use persistent memory. > > > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file. > > Really we should move the abstraction one level up and make the whole VFIO > > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO > > container file and all of the DMA mappings inside VFIO would already be set up. > > I doubt this.. It probably needs to be much finer grained actually, > otherwise you are going to be serializing everything. Somehow I think > you are better to serialize a minimum and try to reconstruct > everything else in userspace. Like conserving iommufd IDs would be a > huge PITA. > > There are also going to be lots of security questions here, like we > can't just let userspace feed in any garbage and violate vfio and > iommu invariants. Right! This is definitely one of the big gaps at the moment: this approach requires that VFIO has the same state re-driven into it from userspace so that the persistent and ephemeral data match. If userspace does something dodgy, well, it may cause problems. :-) That's exactly why I thought we should move the abstraction up to a level that doesn't depend on userspace re-driving data. It sounds like you were suggesting similar in the first part of your comment, but I didn't fully understand how you'd like to see it presented to userspace. JG _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec