Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 10 Mar 2022 11:26:15 -0800 (PST)

On Wed, 9 Mar 2022, Yang Shi wrote:

> > Introduce the main madvise collapse batched logic, including the overall
> > locking strategy.  Stubs for individual batched actions, such as
> > scanning pmds in batch, have been stubbed out, and will be added later
> > in the series.
> >
> > Note the main benefit from doing all this work in a batched manner is
> > that __madvise__collapse_pmd_batch() (stubbed out) can be called inside
> > a single mmap_lock write.
> 
> I don't get why this is preferred? Isn't it more preferred to minimize
> the scope of write mmap_lock? Assuming you batch large number of PMDs,
> MADV_COLLAPSE may hold write mmap_lock for a long time, it doesn't
> seem it could scale.
> 

One concern might be the queueing of read locks needed for page faults 
behind a collapser of a long range of memory that is otherwise looping 
and repeatedly taking the write lock.

To have minimal impact on concurrent page faults, which I think we should 
be optimizing for, I don't know the answer without data.  Any ideas you 
have as a general rule-of-thumb for what would be optimal here between 
collapsing one page at a time vs handling multiple collapses per mmap_lock 
write so that readers aren't constantly getting queued?

The easiest answer would be to not do batching at all and leave the impact 
to readers up to the userspace doing the MADV_COLLAPSE :)  I was wondering 
if there was a better default behavior we could implement in the kernel, 
however.