Re: [PATCH] mm: avoid unconditional one-tick sleep when swapcache_prepare fails

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 03 Oct 2024 08:38:46 +0800

Hi, Kairui,

Kairui Song <ryncsn@xxxxxxxxx> writes:

> On Wed, Oct 2, 2024 at 10:02 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>
>> On Wed, Oct 2, 2024 at 8:43 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >
>> > Barry Song <21cnbao@xxxxxxxxx> writes:
>> >
>> > > On Tue, Oct 1, 2024 at 7:43 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >>
>> > >> Barry Song <21cnbao@xxxxxxxxx> writes:
>> > >>
>> > >> > On Sun, Sep 29, 2024 at 3:43 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >>
>> > >> >> Hi, Barry,
>> > >> >>
>> > >> >> Barry Song <21cnbao@xxxxxxxxx> writes:
>> > >> >>
>> > >> >> > From: Barry Song <v-songbaohua@xxxxxxxx>
>> > >> >> >
>> > >> >> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> > >> >> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
>> > >> >> > fails, which has led to reports of UI stuttering on latency-sensitive
>> > >> >> > Android devices. To address this, we can use a waitqueue to wake up
>> > >> >> > tasks that fail `swapcache_prepare()` sooner, instead of always
>> > >> >> > sleeping for a full tick. While tasks may occasionally be woken by an
>> > >> >> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
>> > >> >> > rapid re-entry into page faults, which can cause livelocks, and
>> > >> >> > multiple millisecond sleeps, which visibly degrade user experience.
>> > >> >>
>> > >> >> In general, I think that this works.  Why not extend the solution to
>> > >> >> cover schedule_timeout_uninterruptible() in __read_swap_cache_async()
>> > >> >> too?  We can call wake_up() when we clear SWAP_HAS_CACHE.  To avoid
>> > >> >
>> > >> > Hi Ying,
>> > >> > Thanks for your comments.
>> > >> > I feel extending the solution to __read_swap_cache_async() should be done
>> > >> > in a separate patch. On phones, I've never encountered any issues reported
>> > >> > on that path, so it might be better suited for an optimization rather than a
>> > >> > hotfix?
>
> Hi Barry and Ying,
>
> For the __read_swap_cache_async case, I'm not really against adding a
> similar workqueue, but if no one is really suffering from it, and if
> the workqueue do causes extra overhead, maybe we can ignore it for the
> __read_swap_cache_async case now, and I plan to resent the following
> patch:
> https://lore.kernel.org/linux-mm/20240326185032.72159-9-ryncsn@xxxxxxxxx/#r
>
> It removed all schedule_timeout_uninterruptible workaround and other
> similar things, and the performance will go even higher.

Sounds good to me.  Please resend it.  It's more complex than Barry's
fix.  So, I suggest to merge Barry's version first.

>> > >>
>> > >> Yes.  It's fine to do that in another patch as optimization.
>> > >
>> > > Ok. I'll prepare a separate patch for optimizing that path.
>> >
>> > Thanks!
>> >
>> > >>
>> > >> >> overhead to call wake_up() when there's no task waiting, we can use an
>> > >> >> atomic to count waiting tasks.
>> > >> >
>> > >> > I'm not sure it's worth adding the complexity, as wake_up() on an empty
>> > >> > waitqueue should have a very low cost on its own?
>> > >>
>> > >> wake_up() needs to call spin_lock_irqsave() unconditionally on a global
>> > >> shared lock.  On systems with many CPUs (such servers), this may cause
>> > >> severe lock contention.  Even the cache ping-pong may hurt performance
>> > >> much.
>> > >
>> > > I understand that cache synchronization was a significant issue before
>> > > qspinlock, but it seems to be less of a concern after its implementation.
>> >
>> > Unfortunately, qspinlock cannot eliminate cache ping-pong issue, as
>> > discussed in the following thread.
>> >
>> > https://lore.kernel.org/lkml/20220510192708.GQ76023@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>> >
>> > > However, using a global atomic variable would still trigger cache broadcasts,
>> > > correct?
>> >
>> > We can only change the atomic variable to non-zero when
>> > swapcache_prepare() returns non-zero, and call wake_up() when the atomic
>> > variable is non-zero.  Because swapcache_prepare() returns 0 most times,
>> > the atomic variable is 0 most times.  If we don't change the value of
>> > atomic variable, cache ping-pong will not be triggered.
>>
>> yes. this can be implemented by adding another atomic variable.
>>
>> >
>> > Hi, Kairui,
>> >
>> > Do you have some test cases to test parallel zram swap-in?  If so, that
>> > can be used to verify whether cache ping-pong is an issue and whether it
>> > can be fixed via a global atomic variable.
>> >
>>
>> Yes, Kairui please run a test on your machine with lots of cores before
>> and after adding a global atomic variable as suggested by Ying. I am
>> sorry I don't have a server machine.
>
> I just had a try with the build kernel test which I used for the
> allocator patch series, with -j64, 1G memcg on my local branch:
>
> Without the patch:
> 2677.63user 9100.43system 3:33.15elapsed 5452%CPU (0avgtext+0avgdata
> 863284maxresident)k
> 2671.40user 8969.07system 3:33.67elapsed 5447%CPU (0avgtext+0avgdata
> 863316maxresident)k
> 2673.66user 8973.90system 3:33.18elapsed 5463%CPU (0avgtext+0avgdata
> 863284maxresident)k
>
> With the patch:
> 2655.05user 9134.21system 3:35.63elapsed 5467%CPU (0avgtext+0avgdata
> 863288maxresident)k
> 2652.57user 9104.87system 3:35.07elapsed 5466%CPU (0avgtext+0avgdata
> 863272maxresident)k
> 2665.44user 9155.97system 3:35.92elapsed 5474%CPU (0avgtext+0avgdata
> 863316maxresident)k
>
> Only three test runs, the main bottleneck for the test is still some
> other locks (list_lru lock, swap cgroup lock etc), but it does show
> the performance seems a bit lower. Could be considered a trivial
> amount of overhead so I think it's acceptable for the SYNC_IO path.

Thanks!  The difference appears measurable although small.  And, in some
use cases, multiple memcg may be used, so list_lru, swap cgroup lock
will be less contended.

--
Best Regards,
Huang, Ying