Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

David Hildenbrand <david@xxxxxxxxxx> · Mon, 8 Feb 2021 11:37:24 +0100

On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:

-----Original Message-----
From: owner-linux-mm@xxxxxxxxx [mailto:owner-linux-mm@xxxxxxxxx] On Behalf Of
David Hildenbrand
Sent: Monday, February 8, 2021 9:22 PM
To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Matthew Wilcox
<willy@xxxxxxxxxxxxx>
Cc: Wangzhou (B) <wangzhou1@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-api@xxxxxxxxxxxxxxx; Andrew
Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>;
gregkh@xxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxx; kevin.tian@xxxxxxxxx;
jean-philippe@xxxxxxxxxx; eric.auger@xxxxxxxxxx; Liguozhu (Kenneth)
<liguozhu@xxxxxxxxxxxxx>; zhangfei.gao@xxxxxxxxxx; chensihang (A)
<chensihang1@xxxxxxxxxxxxx>
Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
pin

On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:

-----Original Message-----
From: owner-linux-mm@xxxxxxxxx [mailto:owner-linux-mm@xxxxxxxxx] On Behalf
Of
Matthew Wilcox
Sent: Monday, February 8, 2021 2:31 PM
To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
Cc: Wangzhou (B) <wangzhou1@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-api@xxxxxxxxxxxxxxx; Andrew
Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Alexander Viro
<viro@xxxxxxxxxxxxxxxxxx>;
gregkh@xxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxx; kevin.tian@xxxxxxxxx;
jean-philippe@xxxxxxxxxx; eric.auger@xxxxxxxxxx; Liguozhu (Kenneth)
<liguozhu@xxxxxxxxxxxxx>; zhangfei.gao@xxxxxxxxxx; chensihang (A)
<chensihang1@xxxxxxxxxxxxx>
Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
pin

On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote:
In high-performance I/O cases, accelerators might want to perform
I/O on a memory without IO page faults which can result in dramatically
increased latency. Current memory related APIs could not achieve this
requirement, e.g. mlock can only avoid memory to swap to backup device,
page migration can still trigger IO page fault.

Well ... we have two requirements.  The application wants to not take
page faults.  The system wants to move the application to a different
NUMA node in order to optimise overall performance.  Why should the
application's desires take precedence over the kernel's desires?  And why
should it be done this way rather than by the sysadmin using numactl to
lock the application to a particular node?

NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.

You're dodging the question.  Should the CMA allocation fail because
another application is using SVA?

I would say no.

I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA.

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.

Long term pinnings cannot go onto CMA-reserved memory, and there is
similar work to also fix ZONE_MOVABLE in that regard.

https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c
om

One of the reasons I detest using long term pinning of pages where it
could be avoided. Take VFIO and RDMA as an example: these things
currently can't work without them.

What I read here: "DMA performance will be affected severely". That does
not sound like a compelling argument to me for long term pinnings.
Please find another way to achieve the same goal without long term
pinnings controlled by user space - e.g., controlling when migration
actually happens.

For example, CMA/alloc_contig_range()/memory unplug are corner cases
that happen rarely, you shouldn't have to worry about them messing with
your DMA performance.

I agree CMA/alloc_contig_range()/memory unplug would be corner cases,
the major cases would be THP, NUMA balancing while we could totally
disable them but it seems insensible to do that only because there is
a process using SVA in the system.

Can't you use huge pages in your application that uses SVA and prevent 
THP/NUMA balancing from kicking in?

--
Thanks,

David / dhildenb