> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Wednesday, December 15, 2021 12:27 AM > > > > + * complete any such outstanding operations prior to completing > > > > + * the transition to the NDMA state. The NDMA device_state > > > > > > Reading this as you wrote it and I suddenly have a doubt about the PRI > > > use case. Is it reasonable that the kernel driver will block on NDMA > > > waiting for another userspace thread to resolve any outstanding PRIs? > > > > > > Can that allow userspace to deadlock the kernel or device? Is there an > > > alterative? > > > > I'd hope we could avoid deadlock in the kernel, but it seems trickier > > for userspace to be waiting on a write(2) operation to the device while > > also handling page request events for that same device. Is this > > something more like a pending transaction bit where userspace asks the > > device to go quiescent and polls for that to occur? > > Hum. I'm still looking into this question, but some further thoughts. > > PRI doesn't do DMA, it just transfers a physical address into the PCI > device's cache that can be later used with DMA. > > PRI also doesn't imply the vPRI Intel is talking about. This is correct. PRI can happen on either kernel-managed page table or user-managed page table. Only for latter case the PRI needs be forwarded to userspace for fixup. > > For PRI controlled by the hypervisor, it is completely reasonable that > NDMA returns synchronously after the PRI and the DMA that triggered it > completes. The VMM would have to understand this and ensure it doesn't > block the kernel's fault path while going to NDMA eg with userfaultfd > or something else crazy. I don't think there would be any problem on this usage. > > The other reasonable option is that NDMA cancels the DMA that > triggered the PRI and simply doesn't care how the PRI is completed > after NDMA returns. > > The later is interesting because it is a possible better path to solve > the vPRI problem Intel brought up. Waiting for the VCPU is just asking > for a DOS, if NDMA can cancel the DMAs we can then just directly fail cancel and save the context so the aborted transaction can be resumed on the target node. > the open PRI in the hypervisor and we don't need to care about the > VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the > resume'd device simply issues a new DMA with an empty ATS cache and > does a new PRI. > > It is uncertain enough that qemu should not support vPRI with > migration until we define protocol(s) and a cap flag to say the device > supports it. > However this is too restricting. It's an ideal option but in reality it implies the capability that the device can preempt and recover an in-fly request in any granularity (given PRI can occur at any time). I was clearly told by hardware guys about how challenging to achieve this goal on various IPs, which is also the reason why the draining operation on most devices today is more-or-less a waiting flavor. btw can you elaborate the DOS concern? The device is assigned to an user application, which has one thread (migration thread) blocked on another thread (vcpu thread) when transiting the device to NDMA state. What service outside of this application is denied here? Thanks Kevin