On Wed, 9 Mar 2022, Yang Shi wrote: > > Introduce the main madvise collapse batched logic, including the overall > > locking strategy. Stubs for individual batched actions, such as > > scanning pmds in batch, have been stubbed out, and will be added later > > in the series. > > > > Note the main benefit from doing all this work in a batched manner is > > that __madvise__collapse_pmd_batch() (stubbed out) can be called inside > > a single mmap_lock write. > > I don't get why this is preferred? Isn't it more preferred to minimize > the scope of write mmap_lock? Assuming you batch large number of PMDs, > MADV_COLLAPSE may hold write mmap_lock for a long time, it doesn't > seem it could scale. > One concern might be the queueing of read locks needed for page faults behind a collapser of a long range of memory that is otherwise looping and repeatedly taking the write lock. To have minimal impact on concurrent page faults, which I think we should be optimizing for, I don't know the answer without data. Any ideas you have as a general rule-of-thumb for what would be optimal here between collapsing one page at a time vs handling multiple collapses per mmap_lock write so that readers aren't constantly getting queued? The easiest answer would be to not do batching at all and leave the impact to readers up to the userspace doing the MADV_COLLAPSE :) I was wondering if there was a better default behavior we could implement in the kernel, however.