> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Tuesday, April 6, 2021 7:35 AM > > On Fri, Apr 02, 2021 at 07:30:23AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Friday, April 2, 2021 12:04 AM > > > > > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote: > > > > > > > DMA page faults are delivered to root-complex via page request > message > > > and > > > > it is per-device according to PCIe spec. Page request handling flow is: > > > > > > > > 1) iommu driver receives a page request from device > > > > 2) iommu driver parses the page request message. Get the RID,PASID, > > > faulted > > > > page and requested permissions etc. > > > > 3) iommu driver triggers fault handler registered by device driver with > > > > iommu_report_device_fault() > > > > > > This seems confused. > > > > > > The PASID should define how to handle the page fault, not the driver. > > > > > > I don't remember any device specific actions in ATS, so what is the > > > driver supposed to do? > > > > > > > 4) device driver's fault handler signals an event FD to notify userspace > to > > > > fetch the information about the page fault. If it's VM case, inject the > > > > page fault to VM and let guest to solve it. > > > > > > If the PASID is set to 'report page fault to userspace' then some > > > event should come out of /dev/ioasid, or be reported to a linked > > > eventfd, or whatever. > > > > > > If the PASID is set to 'SVM' then the fault should be passed to > > > handle_mm_fault > > > > > > And so on. > > > > > > Userspace chooses what happens based on how they configure the PASID > > > through /dev/ioasid. > > > > > > Why would a device driver get involved here? > > > > > > > Eric has sent below series for the page fault reporting for VM with > passthru > > > > device. > > > > https://lore.kernel.org/kvm/20210223210625.604517-5- > > > eric.auger@xxxxxxxxxx/ > > > > > > It certainly should not be in vfio pci. Everything using a PASID needs > > > this infrastructure, VDPA, mdev, PCI, CXL, etc. > > > > > > > This touches an interesting fact: > > > > The fault may be triggered in either 1st-level or 2nd-level page table, > > when nested translation is enabled (in vSVA case). The 1st-level is bound > > by the user space, which therefore needs to receive the fault event. The > > 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in > > kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map > > GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the > > device fault handler, which then forward the event through /dev/ioasid > > to userspace only if it is a 1st-level fault. Are you suggesting a pgtable- > > centric fault reporting mechanism to separate handlers in each level, > > i.e. letting VFIO register handler only for 2nd-level fault and then /dev/ > > ioasid register handler for 1st-level fault? > > This I'm struggling to understand. /dev/ioasid should handle all the > faults cases, why would VFIO ever get involved in a fault? What would > it even do? > > If the fault needs to be fixed in the hypervisor then it is a kernel > fault and it does handle_mm_fault. This absolutely should not be in > VFIO or VDPA With nested translation it is GVA->GPA->HPA. The kernel needs to fix fault related to GPA->HPA (managed by VFIO/VDPA) while handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level page fault is expected to be delivered to VFIO/VDPA first which then find HVA related to GPA, call handle_mm_fault to fix HVA->HPA, and then call iommu_map to fix GPA->HPA in the IOMMU page table. This is exactly like how CPU EPT violation is handled. > > If the fault needs to be fixed in the guest, then it needs to be > delivered over /dev/ioasid in some way and injected into the > vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu. > > You need to have an interface under /dev/ioasid to create both page > table levels and part of that will be to tell the kernel what VA is > mapped and how to handle faults. VFIO/VDPA already have their own interface to manage GPA->HPA mappings. Why do we want to duplicate it in /dev/ioasid? Thanks Kevin