On Fri, Mar 25, 2022 at 9:51 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > Hey All, > > Sorry for the delay. So, I ran some synthetic tests on a dual socket > Skylake with configured batch sizes of 1, 8, 32, and 64. Basic setup > was: 1 thread continuously madvise(MADV_COLLAPSE)'ing memory, 20 > threads continuously faulting-in pages, and some basic synchronization > so that all threads follow a "only do work when all other threads have > work to do" model (i.e. so we don't measure faults in the absence of > simultaneous collapses, or vice versa). I used bpftrace attached to > tracepoint:mmap_lock to measure r/w mmap_lock contention over 20 > minutes. > > Assuming we want to optimize for fault-path readers, the results are > pretty clear: BATCH-1 outperforms BATCH-8, BATCH-32, and BATCH-64 by > 254%, 381%, and 425% respectively, in terms of mean time for > fault-threads to acquire mmap_lock in read, while also having less > tail latency (didn't calculate, just looked at bpftrace histograms). > If we cared at all about madvise(MADV_COLLAPSE) performance, then > BATCH-1 is 83-86% as fast as the others and holds mmap_lock in write > for about the same amount of time in aggregate (~0 +/- 2%). > > I've included the bpftrace histograms for fault-threads acquiring > mmap_lock in read at the end for posterity, and can provide more data > / info if folks are interested. > > In light of these results, I'll rework the code to iteratively operate > on single hugepages, which should have the added benefit of > considerably simplifying the code for an eminent V1 series. Thanks for the data. Yeah, I agree this is the best tradeoff. > > Thanks, > Zach > > bpftrace data: > > /*****************************************************************************/ > batch size: 1 > > @mmap_lock_r_acquire[fault-thread]: > [128, 256) 1254 | | > [256, 512) 2691261 |@@@@@@@@@@@@@@@@@ | > [512, 1K) 2969500 |@@@@@@@@@@@@@@@@@@@ | > [1K, 2K) 1794738 |@@@@@@@@@@@ | > [2K, 4K) 1590984 |@@@@@@@@@@ | > [4K, 8K) 3273349 |@@@@@@@@@@@@@@@@@@@@@ | > [8K, 16K) 851467 |@@@@@ | > [16K, 32K) 460653 |@@ | > [32K, 64K) 7274 | | > [64K, 128K) 25 | | > [128K, 256K) 0 | | > [256K, 512K) 0 | | > [512K, 1M) 8085437 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1M, 2M) 381735 |@@ | > [2M, 4M) 28 | | > > @mmap_lock_r_acquire_stat[fault-thread]: count 22107705, average > 326480, total 7217743234867 > > /*****************************************************************************/ > batch size: 8 > > @mmap_lock_r_acquire[fault-thread]: > [128, 256) 55 | | > [256, 512) 247028 |@@@@@@ | > [512, 1K) 239083 |@@@@@@ | > [1K, 2K) 142296 |@@@ | > [2K, 4K) 153149 |@@@@ | > [4K, 8K) 1899396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 1780734 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [16K, 32K) 95645 |@@ | > [32K, 64K) 1933 | | > [64K, 128K) 3 | | > [128K, 256K) 0 | | > [256K, 512K) 0 | | > [512K, 1M) 0 | | > [1M, 2M) 0 | | > [2M, 4M) 0 | | > [4M, 8M) 1132899 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [8M, 16M) 3953 | | > > @mmap_lock_r_acquire_stat[fault-thread]: count 5696174, average > 1156055, total 6585091744973 > > /*****************************************************************************/ > batch size: 32 > > @mmap_lock_r_acquire[fault-thread]: > [128, 256) 35 | | > [256, 512) 63413 |@ | > [512, 1K) 78130 |@ | > [1K, 2K) 39548 | | > [2K, 4K) 44331 | | > [4K, 8K) 2398751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 1316932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [16K, 32K) 54798 |@ | > [32K, 64K) 771 | | > [64K, 128K) 2 | | > [128K, 256K) 0 | | > [256K, 512K) 0 | | > [512K, 1M) 0 | | > [1M, 2M) 0 | | > [2M, 4M) 0 | | > [4M, 8M) 0 | | > [8M, 16M) 0 | | > [16M, 32M) 280791 |@@@@@@ | > [32M, 64M) 809 | | > > @mmap_lock_r_acquire_stat[fault-thread]: count 4278311, average > 1571585, total 6723733081824 > > /*****************************************************************************/ > batch size: 64 > > @mmap_lock_r_acquire[fault-thread]: > [256, 512) 30303 | | > [512, 1K) 42366 |@ | > [1K, 2K) 23679 | | > [2K, 4K) 22781 | | > [4K, 8K) 1637566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [8K, 16K) 1955773 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 41832 |@ | > [32K, 64K) 563 | | > [64K, 128K) 0 | | > [128K, 256K) 0 | | > [256K, 512K) 0 | | > [512K, 1M) 0 | | > [1M, 2M) 0 | | > [2M, 4M) 0 | | > [4M, 8M) 0 | | > [8M, 16M) 0 | | > [16M, 32M) 0 | | > [32M, 64M) 140723 |@@@ | > [64M, 128M) 77 | | > > @mmap_lock_r_acquire_stat[fault-thread]: count 3895663, average > 1715797, total 6684170171691 > > On Thu, Mar 10, 2022 at 4:06 PM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > > > On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote: > > > > One concern might be the queueing of read locks needed for page faults > > > > behind a collapser of a long range of memory that is otherwise looping > > > > and repeatedly taking the write lock. > > > > > > I would have thought that _not_ batching would improve this situation. > > > Unless our implementation of rwsems has changed since the last time I > > > looked, dropping-and-reacquiring a rwsem while there are pending readers > > > means you go to the end of the line and they all get to handle their > > > page faults. > > > > > > > Hey Matthew, thanks for the review / feedback. > > > > I don't have great intuition here, so I'll try to put together a > > simple synthetic test to get some data. Though the code would be > > different, I can functionally approximate a non-batched approach with > > a batch size of 1, and compare that against N. > > > > My file-backed patches likewise weren't able to take advantage of > > batching outside mmap lock contention, so the data should equally > > apply there.