On Wed, Sep 25, 2024 at 10:44:12AM +0200, Vishal Annapurve wrote: > On Tue, Sep 24, 2024 at 2:07 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote: > > > > IMHO we should try to do as best we can here, and the ideal interface > > > > would be a notifier to switch the shared/private pages in some portion > > > > of the guestmemfd. With the idea that iommufd could perhaps do it > > > > atomically. > > > > > > yes atomic replacement is necessary here, as there might be in-fly > > > DMAs to pages adjacent to the one being converted in the same > > > 1G hunk. Unmap/remap could potentially break it. > > > > Yeah.. This integration is going to be much more complicated than I > > originally thought about. It will need the generic pt stuff as the > > hitless page table manipulations we are contemplating here are pretty > > complex. > > > > Jason > > To ensure that I understand your concern properly, the complexity of > handling hitless page manipulations is because guests can convert > memory at smaller granularity than the physical page size used by the > host software. Yes You want to, say, break up a 1G private page into 2M chunks and then hitlessly replace a 2M chunk with a shared one. Unlike the MM side you don't really want to just non-present the whole thing and fault it back in. So it is more complex. We already plan to build the 1G -> 2M transformation for dirty tracking, the atomic replace will be a further operation. In the short term you could experiment on this using unmap/remap, but that isn't really going to work well as a solution. You really can't unmap an entire 1G page just to poke a 2M hole into it without disrupting the guest DMA. Fortunately the work needed to resolve this is well in progress, I had not realized there was a guest memfd connection, but this is good to know. It means more people will be intersted in helping :) :) > Complexity remains the same irrespective of whether kvm/guest_memfd > is notifying iommu driver to unmap converted ranges or if its > userspace notifying iommu driver. You don't want to use the verb 'unmap'. What you want is a verb more like 'refresh' which can only make sense in the kernel. 'refresh' would cause the iommu copy of the physical addresses to update to match the current data in the guestmemfd. So the private/shared sequence would be like: 1) Guest asks for private -> shared 2) Guestmemfd figures out what the new physicals should be for the shared 3) Guestmemfd does 'refresh' on all of its notifiers. This will pick up the new shared physical and remove the old private physical from the iommus 4) Guestmemfd can be sure nothing in iommu is touching the old memory. There are some other small considerations that increase complexity, like AMD needs an IOPTE boundary at any transition between shared/private. This is a current active bug in the AMD stuff, fixing it automatically and preserving huge pages via special guestmemfd support sounds very appealing to me. Jason