Re: [RFC PATCH 12/14] mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse

Yang Shi <shy828301@xxxxxxxxx> · Fri, 25 Mar 2022 12:54:42 -0700

On Fri, Mar 25, 2022 at 9:51 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
>
> Hey All,
>
> Sorry for the delay. So, I ran some synthetic tests on a dual socket
> Skylake with configured batch sizes of 1, 8, 32, and 64. Basic setup
> was: 1 thread continuously madvise(MADV_COLLAPSE)'ing memory, 20
> threads continuously faulting-in pages, and some basic synchronization
> so that all threads follow a "only do work when all other threads have
> work to do" model (i.e. so we don't measure faults in the absence of
> simultaneous collapses, or vice versa). I used bpftrace attached to
> tracepoint:mmap_lock to measure r/w mmap_lock contention over 20
> minutes.
>
> Assuming we want to optimize for fault-path readers, the results are
> pretty clear: BATCH-1 outperforms BATCH-8, BATCH-32, and BATCH-64 by
> 254%, 381%, and 425% respectively, in terms of mean time for
> fault-threads to acquire mmap_lock in read, while also having less
> tail latency (didn't calculate, just looked at bpftrace histograms).
> If we cared at all about madvise(MADV_COLLAPSE) performance, then
> BATCH-1 is 83-86% as fast as the others and holds mmap_lock in write
> for about the same amount of time in aggregate (~0 +/- 2%).
>
> I've included the bpftrace histograms for fault-threads acquiring
> mmap_lock in read at the end for posterity, and can provide more data
> / info if folks are interested.
>
> In light of these results, I'll rework the code to iteratively operate
> on single hugepages, which should have the added benefit of
> considerably simplifying the code for an eminent V1 series.

Thanks for the data. Yeah, I agree this is the best tradeoff.

>
> Thanks,
> Zach
>
> bpftrace data:
>
> /*****************************************************************************/
> batch size: 1
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)          1254 |                                                    |
> [256, 512)       2691261 |@@@@@@@@@@@@@@@@@                                   |
> [512, 1K)        2969500 |@@@@@@@@@@@@@@@@@@@                                 |
> [1K, 2K)         1794738 |@@@@@@@@@@@                                         |
> [2K, 4K)         1590984 |@@@@@@@@@@                                          |
> [4K, 8K)         3273349 |@@@@@@@@@@@@@@@@@@@@@                               |
> [8K, 16K)         851467 |@@@@@                                               |
> [16K, 32K)        460653 |@@                                                  |
> [32K, 64K)          7274 |                                                    |
> [64K, 128K)           25 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)       8085437 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1M, 2M)          381735 |@@                                                  |
> [2M, 4M)              28 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 22107705, average
> 326480, total 7217743234867
>
> /*****************************************************************************/
> batch size: 8
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)            55 |                                                    |
> [256, 512)        247028 |@@@@@@                                              |
> [512, 1K)         239083 |@@@@@@                                              |
> [1K, 2K)          142296 |@@@                                                 |
> [2K, 4K)          153149 |@@@@                                                |
> [4K, 8K)         1899396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)        1780734 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
> [16K, 32K)         95645 |@@                                                  |
> [32K, 64K)          1933 |                                                    |
> [64K, 128K)            3 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)         1132899 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
> [8M, 16M)           3953 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 5696174, average
> 1156055, total 6585091744973
>
> /*****************************************************************************/
> batch size: 32
>
> @mmap_lock_r_acquire[fault-thread]:
> [128, 256)            35 |                                                    |
> [256, 512)         63413 |@                                                   |
> [512, 1K)          78130 |@                                                   |
> [1K, 2K)           39548 |                                                    |
> [2K, 4K)           44331 |                                                    |
> [4K, 8K)         2398751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)        1316932 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
> [16K, 32K)         54798 |@                                                   |
> [32K, 64K)           771 |                                                    |
> [64K, 128K)            2 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)               0 |                                                    |
> [8M, 16M)              0 |                                                    |
> [16M, 32M)        280791 |@@@@@@                                              |
> [32M, 64M)           809 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 4278311, average
> 1571585, total 6723733081824
>
> /*****************************************************************************/
> batch size: 64
>
> @mmap_lock_r_acquire[fault-thread]:
> [256, 512)         30303 |                                                    |
> [512, 1K)          42366 |@                                                   |
> [1K, 2K)           23679 |                                                    |
> [2K, 4K)           22781 |                                                    |
> [4K, 8K)         1637566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
> [8K, 16K)        1955773 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)         41832 |@                                                   |
> [32K, 64K)           563 |                                                    |
> [64K, 128K)            0 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)               0 |                                                    |
> [8M, 16M)              0 |                                                    |
> [16M, 32M)             0 |                                                    |
> [32M, 64M)        140723 |@@@                                                 |
> [64M, 128M)           77 |                                                    |
>
> @mmap_lock_r_acquire_stat[fault-thread]: count 3895663, average
> 1715797, total 6684170171691
>
> On Thu, Mar 10, 2022 at 4:06 PM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
> >
> > On Thu, Mar 10, 2022 at 12:17 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 10, 2022 at 11:26:15AM -0800, David Rientjes wrote:
> > > > One concern might be the queueing of read locks needed for page faults
> > > > behind a collapser of a long range of memory that is otherwise looping
> > > > and repeatedly taking the write lock.
> > >
> > > I would have thought that _not_ batching would improve this situation.
> > > Unless our implementation of rwsems has changed since the last time I
> > > looked, dropping-and-reacquiring a rwsem while there are pending readers
> > > means you go to the end of the line and they all get to handle their
> > > page faults.
> > >
> >
> > Hey Matthew, thanks for the review / feedback.
> >
> > I don't have great intuition here, so I'll try to put together a
> > simple synthetic test to get some data. Though the code would be
> > different, I can functionally approximate a non-batched approach with
> > a batch size of 1, and compare that against N.
> >
> > My file-backed patches likewise weren't able to take advantage of
> > batching outside mmap lock contention, so the data should equally
> > apply there.