Hey All, Sorry for the delay. So, I ran some synthetic tests on a dual socket Skylake with configured batch sizes of 1, 8, 32, and 64. Basic setup was: 1 thread continuously madvise(MADV_COLLAPSE)'ing memory, 20 threads continuously faulting-in pages, and some basic synchronization so that all threads follow a "only do work when all other threads have work to do" model (i.e. so we don't measure faults in the absence of simultaneous collapses, or vice versa). I used bpftrace attached to tracepoint:mmap_lock to measure r/w mmap_lock contention over 20 minutes. Assuming we want to optimize for fault-path readers, the results are pretty clear: BATCH-1 outperforms BATCH-8, BATCH-32, and BATCH-64 by 254%, 381%, and 425% respectively, in terms of mean time for fault-threads to acquire mmap_lock in read, while also having less tail latency (didn't calculate, just looked at bpftrace histograms). If we cared at all about madvise(MADV_COLLAPSE) performance, then BATCH-1 is 83-86% as fast as the others and holds mmap_lock in write for about the same amount of time in aggregate (~0 +/- 2%). I've included the bpftrace histograms for fault-threads acquiring mmap_lock in read at the end for posterity, and can provide more data / info if folks are interested. In light of these results, I'll rework the code to iteratively operate on single hugepages, which should have the added benefit of considerably simplifying the code for an eminent V1 series. Thanks, Zach bpftrace data: /*****************************************************************************/ batch size: 1 @mmap_lock_r_acquire[fault-thread]: [128, 256) 1254 | | [256, 512) 2691261 |@@@@@@@@@@@@@@@@@ | [512, 1K) 2969500 |@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 1794738 |@@@@@@@@@@@ | [2K, 4K) 1590984 |@@@@@@@@@@ | [4K, 8K) 3273349 |@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 851467 |@@@@@ | [16K, 32K) 460653 |@@ | [32K, 64K) 7274 | | [64K, 128K) 25 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 8085437 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1M, 2M) 381735 |@@ | [2M, 4M) 28 | | @mmap_lock_r_acquire_stat[fault-thread]: count 22107705, average 326480, total 7217743234867 /*****************************************************************************/ batch size: 8 @mmap_lock_r_acquire[fault-thread]: [128, 256) 55 | | [256, 512) 247028 |@@@@@@ | [512, 1K) 239083 |@@@@@@ | [1K, 2K) 142296 |@@@ | [2K, 4K) 153149 |@@@@ | [4K, 8K) 1899396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 1780734 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 95645 |@@ | [32K, 64K) 1933 | | [64K, 128K) 3 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 1132899 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8M, 16M) 3953 | | @mmap_lock_r_acquire_stat[fault-thread]: count 5696174, average 1156055, total 6585091744973 /*****************************************************************************/ batch size: 32 @mmap_lock_r_acquire[fault-thread]: [128, 256) 35 | | [256, 512) 63413 |@ | [512, 1K) 78130 |@ | [1K, 2K) 39548 | | [2K, 4K) 44331 | | [4K, 8K) 2398751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 1316932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 54798 |@ | [32K, 64K) 771 | | [64K, 128K) 2 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 280791 |@@@@@@ | [32M, 64M) 809 | | @mmap_lock_r_acquire_stat[fault-thread]: count 4278311, average 1571585, total 6723733081824 /*****************************************************************************/ batch size: 64 @mmap_lock_r_acquire[fault-thread]: [256, 512) 30303 | | [512, 1K) 42366 |@ | [1K, 2K) 23679 | | [2K, 4K) 22781 | | [4K, 8K) 1637566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 1955773 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 41832 |@ | [32K, 64K) 563 | | [64K, 128K) 0 | | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 0 | | [2M, 4M) 0 | | [4M, 8M) 0 | | [8M, 16M) 0 | | [16M, 32M) 0 | | [32M, 64M) 140723 |@@@ | [64M, 128M) 77 | | @mmap_lock_r_acquire_stat[fault-thread]: count 3895663, average 1715797, total 6684170171691 On Thu, Mar 10, 2022 at 4:06 PM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote: > > > One concern might be the queueing of read locks needed for page faults > > > behind a collapser of a long range of memory that is otherwise looping > > > and repeatedly taking the write lock. > > > > I would have thought that _not_ batching would improve this situation. > > Unless our implementation of rwsems has changed since the last time I > > looked, dropping-and-reacquiring a rwsem while there are pending readers > > means you go to the end of the line and they all get to handle their > > page faults. > > > > Hey Matthew, thanks for the review / feedback. > > I don't have great intuition here, so I'll try to put together a > simple synthetic test to get some data. Though the code would be > different, I can functionally approximate a non-batched approach with > a batch size of 1, and compare that against N. > > My file-backed patches likewise weren't able to take advantage of > batching outside mmap lock contention, so the data should equally > apply there.