On 02.05.23 19:46, Jason Gunthorpe wrote:
On Tue, May 02, 2023 at 06:32:23PM +0200, David Hildenbrand wrote:
On 02.05.23 18:19, Jason Gunthorpe wrote:
On Tue, May 02, 2023 at 06:12:39PM +0200, David Hildenbrand wrote:
It missses the general architectural point why we have all these
shootdown mechanims in other places - plares are not supposed to make
these kinds of assumptions. When the userspace unplugs the memory from
KVM or unmaps it from VFIO it is not still being accessed by the
kernel.
Yes. Like having memory in a vfio iommu v1 and doing the same (mremap,
munmap, MADV_DONTNEED, ...). Which is why we disable MADV_DONTNEED (e.g.,
virtio-balloon) in QEMU with vfio.
That is different, VFIO has it's own contract how it consumes the
memory from the MM and VFIO breaks all this stuff.
But when you tell VFIO to unmap the memory it doesn't keep accessing
it in the background like this does.
To me, this is similar to when QEMU (user space) triggers
KVM_S390_ZPCIOP_DEREG_AEN, to tell KVM to disable AIF and stop using the
page (1) When triggered by the guest explicitly (2) when resetting the VM
(3) when resetting the virtual PCI device / configuration.
Interrupt gets unregistered from HW (which stops using the page), the pages
get unpinned. Pages get no longer used.
I guess I am still missing (a) how this is fundamentally different (b) how
it could be done differently.
It uses an address that is already scoped within the KVM memory map
and uses KVM's gpa_to_gfn() to translate it to some pinnable page
It is not some independent thing like VFIO, it is explicitly scoped
within the existing KVM structure and it does not follow any mutations
that are done to the gpa map through the usual KVM APIs.
Right, it consumes guest physical addresses that are translated via the KVM memslots.
Agreed that it does not (and possibly cannot easily) update the hardware when the KVM
mapping (memslots) would ever change.
I guess it's also not documented that this is not supported.
I'd really be happy to learn how a better approach would look like that does
not use longterm pinnings.
Sounds like the FW sadly needs pinnings. This is why I said it looks
like DMA. If possible it would be better to get the pinning through
VFIO, eg as a mdev
Otherwise, it would have been cleaner if this was divorced from KVM
and took in a direct user pointer, then maybe you could make the
argument is its own thing with its own lifetime rules. (then you are
kind of making your own mdev)
It would be cleaner if user space would translate the GPA to a HVA and provid
that, agreed ...
Or, perhaps, this is really part of some radical "irqfd" that we've
been on and off talking about specifically to get this area of
interrupt bypass uAPI'd properly..
Most probably. It's one of these very special cases ... thankfully:
$ git grep -i longterm | grep kvm
arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM, pages);
arch/s390/kvm/pci.c: npages = pin_user_pages_fast(hva, 1, FOLL_WRITE | FOLL_LONGTERM,
--
Thanks,
David / dhildenb