Re: [RFC PATCH 05/13] iommufd: Serialise persisted iommufds and ioas

"Gowans, James" <jgowans@xxxxxxxxxx> · Thu, 10 Oct 2024 15:12:09 +0000

On Wed, 2024-10-09 at 09:28 -0300, Jason Gunthorpe wrote:
> On Wed, Oct 09, 2024 at 11:44:30AM +0000, Gowans, James wrote:
> 
> > Okay, but in general this still means that the page tables must have
> > exactly the same translations if we try to switch from one set to
> > another. If it is possible to change translations then translation table
> > entries could be created at different granularity (PTE, PMD, PUD) level
> > which would violate this requirement.
> 
> Yes, but we strive to make page tables consistently and it isn't that
> often that we get new features that would chang the layout (contig bit
> for instance). I'd suggest in these cases you'd add some creation flag
> to the HWPT that can inhibit the new feature and your VMM will deal
> with it.
> 
> Or you sweep it and manually split/join to deal with BBM < level
> 2. Generic pt will have code to do all of this so it is not that bad.
> 
> If this little issue already scares you then I don't think I want to
> see you serialize anything more complex, there are endless scenarios
> for compatibility problems :\

The things that scare me is some subtle page table difference which
causes silent data corruption... This is one of the reasons I liked re-
using the existing tables, there is no way for this sort of subtle bug
to happen.
In general I agree with your point about reducing the complexity of what
we serialise and rather exercising the machinery again after kexec to
create fresh objects. The only (?) counter point to that is about
performance of anything in the hot path (discussed more below).

> 
> > If we say that to be safe/correct in the general case then it is
> > necessary for the translations to be *exactly* the same before and after
> > kexec, is there any benefit to building new translation tables and
> > switching to them? We may as well continue to use the exact same page
> > tables and construct iommufd objects (IOAS, etc) to match.
> 
> The benifit is principally that you did all the machinery to get up to
> that point, including re-pinning and so forth all the memory, instead
> of trying to magically recover that additional state.
> 
> This is the philosophy that you replay instead of de-serialize, so you
> have to replay into a page table at some level to make that work.

We could have some "skip_pgtable_update" flag which the replay machinery
sets, allowing IOMMUFD to create fresh objects internally and leave the
page tables alone?

Anyway, that sort of thing is a potential future optimisation we could
do. In the interests of making progress I'll re-work this RFC to re-
create everything (iommufd objects, IOMMU driver domains and page
tables) and do the atomic handover of page tables after re-
initialisation.

> 
> 
> > then it would be useful to avoid rebuilding identical tables. Maybe it
> > ends up being in the "warm" path - the VM can start running but will
> > sleep if taking a page fault before IOMMUFD is re-initalised...
> 
> I didn't think you'd support page faults? There are bigger issues here
> if you expect to have a vIOMMU in the guest.

vIOMMU is one case, but another is memory oversubscription. With PRI/ATS
we can oversubscribe memory which is DMA mapped. In that case a page
fault would be a blocking operation until IOMMUFD is all set up and
ready to go. I suspect there will be benefit in getting this fast, but
as long as we have a path to optimise it in future I'm totally fine to
start with re-creating everything.

JG