Re: [RFC PATCH 0/3] support large folio for mlock

David Hildenbrand <david@xxxxxxxxxx> · Mon, 10 Jul 2023 11:57:50 +0200

On 10.07.23 11:43, Yin, Fengwei wrote:
Hi David,

On 7/10/2023 5:32 PM, David Hildenbrand wrote:
On 09.07.23 15:25, Yin, Fengwei wrote:

On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
I would be tempted to allocate memory & copy to the new mlocked VMA.
The old folio will go on the deferred_list and be split later, or its
valid parts will be written to swap and then it can be freed.
If the large folio splitting failure is because of GUP pages, can we
do copy here?

Let's say, if the GUP page is target of DMA operation and DMA operation
is ongoing. We allocated a new page and copy GUP page content to the
new page, the data in the new page can be corrupted.

No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
Thanks a lot for clarification.

So my understanding is that if large folio splitting fails, it's not always
true that we can allocate new folios, copy original large folio content to
new folios, remove original large folio from VMA and map the new folios to
VMA (like it's only true if original large folio is marked as maybe shared).

While it might work in many cases, there are some corner cases where it 
won't work.

So to summarize

(1) THP are transparent and should not result in arbitrary syscall
    failures.
(2) Splitting a THP might fail at random points in time either due to
    GUP pins or due to speculative page references (including
    speculative GUP pins).
(3) Replacing an exclusive anon page that maybe pinned will result in
    memory corruptions.

So we can try to split any THP that crosses VMA borders on VMA 
modifications (split due to munmap, mremap, madvise, mprotect, mlock, 
...), it's not guaranteed to work due to (1). And we can try to replace 
pages such pages, but it's not guaranteed to be allowed due to (3).

And as it's all transparent, we cannot fail (1).

For the other cases that Willy and I discussed (split on VMA 
modifications after fork()), we can at least always replace the anon page.

<details>

What always works, is putting the THP on the deferred split queue to see 
if we can split it later. The deferred split queue is a bit suboptimal 
right now, because it requires the (sub)page mapcounts to detect whether 
the folio is partially mapped vs. fully mapped. If we want to get rid of 
that, we have to come up with something reasonable.

I was wondering if we could have a an optimized deferred split queue, 
that only conditionally splits: do an rmap walk and detect if (a) each 
page of the folio is still mapped (b) the folio does not cross a VMA. If 
both are met, one could skip the deferred split. But that needs a bit of 
thought -- but we're already doing an rmap walk when splitting, so 
scanning which parts are actually mapped does not sound too weird.

</details>

--
Cheers,

David / dhildenb