Re: [PATCH v2 00/16] Multigenerational LRU Framework

Yu Zhao <yuzhao@xxxxxxxxxx> · Wed, 14 Apr 2021 04:00:05 -0600

On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > > > From: SeongJae Park <sjpark@xxxxxxxxx>
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > >
> > > > > > Very interesting work, thank you for sharing this :)
> > > > > >
> > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> > > > > >
> > > > > >> What's new in v2
> > > > > >> ================
> > > > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > > > >> I/O and helping test the fix.
> > > > > >
> > > > > > Is the discussion open?  If so, could you please give me a link?
> > > > >
> > > > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > > > it's on the google page reclaim list. Not sure if that is public or not.
> > > > >
> > > > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > > > to suck (a lot) for high throughput applications. My test case was
> > > > > pretty simple:
> > > > >
> > > > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > > > happens when the page cache gets filled up. For this particular test,
> > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > > > behavior.
> > > >
> > > > I see this exact same behaviour here, too, but I RCA'd it to
> > > > contention between the inode and memory reclaim for the mapping
> > > > structure that indexes the page cache. Basically the mapping tree
> > > > lock is the contention point here - you can either be adding pages
> > > > to the mapping during IO, or memory reclaim can be removing pages
> > > > from the mapping, but we can't do both at once.
> > > >
> > > > So we end up with kswapd spinning on the mapping tree lock like so
> > > > when doing 1.6GB/s in 4kB buffered IO:
> > > >
> > > > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> > > >    - 20.06% kswapd                                                                                                                                             ▒
> > > >       - 20.05% balance_pgdat                                                                                                                                   ▒
> > > >          - 20.03% shrink_node                                                                                                                                  ▒
> > > >             - 19.92% shrink_lruvec                                                                                                                             ▒
> > > >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> > > >                   - 19.22% shrink_page_list                                                                                                                    ▒
> > > >                      - 17.51% __remove_mapping                                                                                                                 ▒
> > > >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> > > >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> > > >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> > > >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> > > >                              0.63% xas_store                                                                                                                   ▒
> > > >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> > > >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> > > >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> > > >                      - 0.82% free_unref_page_list                                                                                                              ▒
> > > >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> > > >                              0.57% free_pcppages_bulk                                                                                                          ▒
> > > >
> > > > And these are the processes consuming CPU:
> > > >
> > > >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> > > >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> > > >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> > > >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> > > >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
> > > >
> > > > i.e. when memory reclaim kicks in, the read process has 20% less
> > > > time with exclusive access to the mapping tree to insert new pages.
> > > > Hence buffered read performance goes down quite substantially when
> > > > memory reclaim kicks in, and this really has nothing to do with the
> > > > memory reclaim LRU scanning algorithm.
> > > >
> > > > I can actually get this machine to pin those 5 processes to 100% CPU
> > > > under certain conditions. Each process is spinning all that extra
> > > > time on the mapping tree lock, and performance degrades further.
> > > > Changing the LRU reclaim algorithm won't fix this - the workload is
> > > > solidly bound by the exclusive nature of the mapping tree lock and
> > > > the number of tasks trying to obtain it exclusively...
> > > >
> > > > > The initial posting of this patchset did no better, in fact it did a bit
> > > > > worse. Performance dropped to the same levels and kswapd was using as
> > > > > much CPU as before, but on top of that we also got excessive swapping.
> > > > > Not at a high rate, but 5-10MB/sec continually.
> > > > >
> > > > > I had some back and forths with Yu Zhao and tested a few new revisions,
> > > > > and the current series does much better in this regard. Performance
> > > > > still dips a bit when page cache fills, but not nearly as much, and
> > > > > kswapd is using less CPU than before.
> > > >
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.
> 
> > > And kswapd is only one of two paths that could affect the performance.
> > > The kernel context of the test process is where the improvement mainly
> > > comes from.
> > >
> > > I also suspect you were testing a file much larger than your memory
> > > size. If so, sorry to tell you that a file only a few times larger,
> > > e.g. twice, would be worse.
> > >
> > > Here is my take:
> > >
> > > Claim
> > > -----
> > > This patchset is a "better" algorithm. (Technically it's not an
> > > algorithm, it's a feedback loop.)
> > >
> > > Theoretical basis
> > > -----------------
> > > An open-loop control (the mainline) can only be better if the margin
> > > of error in its prediction of the future events is less than that from
> > > the trial-and-error of a closed-loop control (this patchset). For
> > > simple machines, it surely can. For page reclaim, AFAIK, it can't.
> > >
> > > A typical example: when randomly accessing a (not infinitely) large
> > > file via buffered io long enough, we're bound to hit the same blocks
> > > multiple times. Should we activate the pages containing those blocks,
> > > i.e., to move them to the active lru list?  No.
> > >
> > > RCA
> > > ---
> > > For the fio/io_uring benchmark, the "No" is the key.
> > >
> > > The mainline activates pages accessed multiple times. This is done in
> > > the buffered io access path by mark_page_accessed(), and it takes the
> > > lru lock, which is contended under memory pressure. This contention
> > > slows down both the access path and kswapd. But kswapd is not the
> > > problem here because we are measuring the io_uring process, not kswap.
> > >
> > > For this patchset, there are no activations since the refault rates of
> > > pages accessed multiple times are similar to those accessed only once
> > > -- activations will only be done to pages from tiers with higher
> > > refault rates.
> > >
> > > If you wish to debunk
> > > ---------------------
> >
> > Nope, it's your job to convince us that it works, not the other way
> > around. It's up to you to prove that your assertions are correct,
> > not for us to prove they are false.
> 
> Just trying to keep people motivated, my homework is my own.
> 
> > > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> > >
> > > CONFIG_LRU_GEN=y
> > > CONFIG_LRU_GEN_ENABLED=y
> > >
> > > Run your benchmarks
> > >
> > > Profiles (200G mem + 400G file)
> > > -------------------------------
> > > A quick test from Jens' fio/io_uring:
> > >
> > > -rc7
> > >     13.30%  io_uring  xas_load
> > >     13.22%  io_uring  _copy_to_iter
> > >     12.30%  io_uring  __add_to_page_cache_locked
> > >      7.43%  io_uring  clear_page_erms
> > >      4.18%  io_uring  filemap_get_read_batch
> > >      3.54%  io_uring  get_page_from_freelist
> > >      2.98%  io_uring  ***native_queued_spin_lock_slowpath***
> > >      1.61%  io_uring  page_cache_ra_unbounded
> > >      1.16%  io_uring  xas_start
> > >      1.08%  io_uring  filemap_read
> > >      1.07%  io_uring  ***__activate_page***
> > >
> > > lru lock: 2.98% (lru addition + activation)
> > > activation: 1.07%
> > >
> > > -rc7 + this patchset
> > >     14.44%  io_uring  xas_load
> > >     14.14%  io_uring  _copy_to_iter
> > >     11.15%  io_uring  __add_to_page_cache_locked
> > >      6.56%  io_uring  clear_page_erms
> > >      4.44%  io_uring  filemap_get_read_batch
> > >      2.14%  io_uring  get_page_from_freelist
> > >      1.32%  io_uring  page_cache_ra_unbounded
> > >      1.20%  io_uring  psi_group_change
> > >      1.18%  io_uring  filemap_read
> > >      1.09%  io_uring  ****native_queued_spin_lock_slowpath****
> > >      1.08%  io_uring  do_mpage_readpage
> > >
> > > lru lock: 1.09% (lru addition only)
> >
> > All this tells us is that there was *less contention on the mapping
> > tree lock*. It does not tell us why there was less contention.
> >
> > You've handily omitted the kswapd profile, which is really the one
> > of interest to the discussion here - how did the memory reclaim CPU
> > usage profile also change at the same time?
> 
> Well, let me attach them. Suffix -1 is the mainline, -2 is the patchset.
> 
>   mainline
>      57.65%  kswapd0  __remove_mapping
>   this patchset
>      61.61%  kswapd0  __remove_mapping
> 
> As I said, the mapping lock contention penalizes both heavily. Its
> percentage is even higher with the patchset, because it has less
> overhead. I'm trying to explain "the less overhead" part: it's the
> activations that make the mainline worse.
> 
>   mainline
>     6.53%  kswapd0  shrink_active_list
>   this patchset
>     0
> 
> From the io_uring context:
>   mainline
>      2.53%  io_uring  mark_page_accessed
>   this patchset
>      0.52%  io_uring  mark_page_accessed
> 
> mark_page_accessed() moves pages accessed multiple times to the active
> lru list. Then shrink_active_list() moves them back to the inactive
> list. All for nothing.
> 
> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.
> 
> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> > > lot better at finding pages to reclaim and therefore doesn't overkill
> > > slabs.
> >
> > That's actually very likely to be a Bad Thing and cause unexpected
> > perofrmance and OOM based regressions. When the machine finally runs
> > out of page cache it can easily reclaim, it's going to get stuck
> > with long tail latencies reclaiming huge slab caches as they've had
> > no substantial ongoing pressure put on them to keep them in balance
> > with the overall memory pressure the system is under...
> 
> Well. It does use the existing equation. That is if it scans X% of
> pages, then it scans X% of slab objects. But 1) it often finds pages
> to reclaim at a lower X% 2) the pages it reclaims are less likely to
> refault. So the side effect is the overall slab objects it scans also
> reduce. I do see your point but don't see any options, at the moment.

I apologize for the spam. Apparent the attachment in my previous email
didn't reach everybody. I hope this would work:

git clone https://linux-mm.googlesource.com/benchmarks

Repo contains profiles collected when running fio/io_uring,
  mainline:
    kswapd-1.txt
    kswapd-1.svg
    io_uring-1.txt
    io_uring-1.svg

  patched:
    kswapd-2.txt
    kswapd-2.svg
    io_uring-2.txt
    io_uring-2.svg

Thanks.