RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

"Song Bao Hua (Barry Song)" <song.bao.hua@xxxxxxxxxxxxx> · Mon, 8 Feb 2021 20:35:31 +0000

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@xxxxxxxx]
> Sent: Tuesday, February 9, 2021 7:34 AM
> To: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Wangzhou (B) <wangzhou1@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-api@xxxxxxxxxxxxxxx; Andrew
> Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>;
> gregkh@xxxxxxxxxxxxxxxxxxx; Song Bao Hua (Barry Song)
> <song.bao.hua@xxxxxxxxxxxxx>; kevin.tian@xxxxxxxxx;
> jean-philippe@xxxxxxxxxx; eric.auger@xxxxxxxxxx; Liguozhu (Kenneth)
> <liguozhu@xxxxxxxxxxxxx>; zhangfei.gao@xxxxxxxxxx; chensihang (A)
> <chensihang1@xxxxxxxxxxxxx>
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> 
> > People are constantly struggling with the effects of long term pinnings
> > under user space control, like we already have with vfio and RDMA.
> >
> > And here we are, adding yet another, easier way to mess with core MM in the
> > same way. This feels like a step backwards to me.
> 
> Yes, this seems like a very poor candidate to be a system call in this
> format. Much too narrow, poorly specified, and possibly security
> implications to allow any process whatsoever to pin memory.
> 
> I keep encouraging people to explore a standard shared SVA interface
> that can cover all these topics (and no, uaccel is not that
> interface), that seems much more natural.
> 
> I still haven't seen an explanation why DMA is so special here,
> migration and so forth jitter the CPU too, environments that care
> about jitter have to turn this stuff off.

This paper has a good explanation:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7482091

mainly because page fault can go directly to the CPU and we have
many CPUs. But IO Page Faults go a different way, thus mean much
higher latency 3-80x slower than page fault:
events in hardware queue -> Interrupts -> cpu processing page fault
-> return events to iommu/device -> continue I/O.

Copied from the paper:

If the IOMMU's page table walker fails to find the desired
translation in the page table, it sends an ATS response to
the GPU notifying it of this failure. This in turn corresponds
to a page fault. In response, the GPU sends another request to
the IOMMU called a Peripheral Page Request (PPR). The IOMMU
places this request in a memory-mapped queue and raises an
interrupt on the CPU. Multiple PPR requests can be queued
before the CPU is interrupted. The OS must have a suitable
IOMMU driver to process this interrupt and the queued PPR
requests. In Linux, while in an interrupt context, the driver
pulls PPR requests from the queue and places them in a work-queue
for later processing. Presumably this design decision was made
to minimize the time spent executing in an interrupt context,
where lower priority interrupts would be dis-abled. At a later
time, an OS worker-thread calls back into the driver to process
page fault requests in the work-queue. Once the requests are
serviced, the driver notifies the IOMMU. In turn, the IOMMU
notifies the GPU. The GPU then sends an-other ATS request to
retry the translation for the original fault-ing address.

Comparison with CPU: On the CPU, a hardware excep-tion is
raised on a page fault, which immediately switches to the
OS. In most cases in Linux, this routine services the page
fault directly, instead of queuing it for later processing.
Con-trast this with a page fault from an accelerator, where
the IOMMU has to interrupt the CPU to request service on
its be-half, and also note the several back-and-forth messages
be-tween the accelerator, the IOMMU, and the CPU. Further-more,
page faults on the CPU are generally handled one at a time
on the CPU, while for the GPU they are batched by the IOMMU
and OS work-queue mechanism.

> 
> Jason

Thanks
Barry