Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails

Kairui Song <ryncsn@xxxxxxxxx> · Thu, 3 Oct 2024 02:30:29 +0800

On Wed, Oct 2, 2024 at 10:02 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
> > Barry Song <21cnbao@xxxxxxxxx> writes:
> >
> > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > >>
> > >> Barry Song <21cnbao@xxxxxxxxx> writes:
> > >>
> > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > >> >>
> > >> >> Hi, Barry,
> > >> >>
> > >> >> Barry Song <21cnbao@xxxxxxxxx> writes:
> > >> >>
> > >> >> > From: Barry Song <v-songbaohua@xxxxxxxx>
> > >> >> >
> > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
> > >> >> > Android devices. To address this, we can use a waitqueue to wake up
> > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
> > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > >> >> > rapid re-entry into page faults, which can cause livelocks, and
> > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
> > >> >>
> > >> >> In general, I think that this works.  Why not extend the solution to
> > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
> > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
> > >> >
> > >> > Hi Ying,
> > >> > Thanks for your comments.
> > >> > I feel extending the solution to __read_swap_cache_async() should be done
> > >> > in a separate patch. On phones, I've never encountered any issues reported
> > >> > on that path, so it might be better suited for an optimization rather than a
> > >> > hotfix?

Hi Barry and Ying,

For the __read_swap_cache_async case, I'm not really against adding a
similar workqueue, but if no one is really suffering from it, and if
the workqueue do causes extra overhead, maybe we can ignore it for the
__read_swap_cache_async case now, and I plan to resent the following
patch:
https://lore.kernel.org/linux-mm/20240326185032.72159-9-ryncsn@xxxxxxxxx/#r

It removed all schedule_timeout_uninterruptible workaround and other
similar things, and the performance will go even higher.

> > >>
> > >> Yes.  It's fine to do that in another patch as optimization.
> > >
> > > Ok. I'll prepare a separate patch for optimizing that path.
> >
> > Thanks!
> >
> > >>
> > >> >> overhead to call wake_up() when there's no task waiting, we can use an
> > >> >> atomic to count waiting tasks.
> > >> >
> > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
> > >> > waitqueue should have a very low cost on its own?
> > >>
> > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
> > >> shared lock.  On systems with many CPUs (such servers), this may cause
> > >> severe lock contention.  Even the cache ping-pong may hurt performance
> > >> much.
> > >
> > > I understand that cache synchronization was a significant issue before
> > > qspinlock, but it seems to be less of a concern after its implementation.
> >
> > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
> > discussed in the following thread.
> >
> > https://lore.kernel.org/lkml/20220510192708.GQ76023@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >
> > > However, using a global atomic variable would still trigger cache broadcasts,
> > > correct?
> >
> > We can only change the atomic variable to non-zero when
> > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
> > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
> > the atomic variable is 0 most times.  If we don't change the value of
> > atomic variable, cache ping-pong will not be triggered.
>
> yes. this can be implemented by adding another atomic variable.
>
> >
> > Hi, Kairui,
> >
> > Do you have some test cases to test parallel zram swap-in?  If so, that
> > can be used to verify whether cache ping-pong is an issue and whether it
> > can be fixed via a global atomic variable.
> >
>
> Yes, Kairui please run a test on your machine with lots of cores before
> and after adding a global atomic variable as suggested by Ying. I am
> sorry I don't have a server machine.

I just had a try with the build kernel test which I used for the
allocator patch series, with -j64, 1G memcg on my local branch:

Without the patch:
2677.63user 9100.43system 3:33.15elapsed 5452%CPU (0avgtext+0avgdata
863284maxresident)k
2671.40user 8969.07system 3:33.67elapsed 5447%CPU (0avgtext+0avgdata
863316maxresident)k
2673.66user 8973.90system 3:33.18elapsed 5463%CPU (0avgtext+0avgdata
863284maxresident)k

With the patch:
2655.05user 9134.21system 3:35.63elapsed 5467%CPU (0avgtext+0avgdata
863288maxresident)k
2652.57user 9104.87system 3:35.07elapsed 5466%CPU (0avgtext+0avgdata
863272maxresident)k
2665.44user 9155.97system 3:35.92elapsed 5474%CPU (0avgtext+0avgdata
863316maxresident)k

Only three test runs, the main bottleneck for the test is still some
other locks (list_lru lock, swap cgroup lock etc), but it does show
the performance seems a bit lower. Could be considered a trivial
amount of overhead so I think it's acceptable for the SYNC_IO path.