Re: Regression in workingset_refault latency on 5.15

Shakeel Butt <shakeelb@xxxxxxxxxx> · Wed, 23 Feb 2022 12:28:23 -0800

On Wed, Feb 23, 2022 at 11:28 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote:
>
[...]
> > 2) Can you please use the similar bpf+kprobe tracing for the
> > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> > source of frequent stat updates.
>
> "memcg_rstat_updated" is "static inline".
>
> With the following:
>
> bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'
>
[...]

Thanks, it is helpful. It seems like most of the stats updates are
happening on the anon page faults and based on signature, it seems
like swap refaults.

>
> > 3) I am still pondering why disabling swap resolves the issue for you.
> > Is that only for a workload different from xfs read?
>
> My understanding is that any block IO (including swap) triggers new
> memcg accounting code. In our process we don't have any other IO than
> swap, so disabling swap removes the major (if not only) vector of
> triggering this issue.
>

Now, I understand why disabling swap is helping your case as the
number of stat updates would be reduced drastically and rstat flush
would happen async most of the time.

[...]
> I should mention that there are really two issues:
>
> 1. Expensive workingset_refault, which shows up on flamegraphs. We see
> it for our rocksdb based database, which persists data on xfs (local
> nvme).
> 2. Expensive workingset_refault that causes latency hiccups, but
> doesn't show up on flamegraphs. We see it in our nginx based proxy
> with swap enabled (either zram or regular file on xfs on local nvme).
>
> We solved the latter by disabling swap. I think the proper solution
> would be for workingset_refault to be fast enough to be invisible, in
> line with what was happening on Linux 5.10.

Thanks for the info. Is it possible to test
https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/
?

If that patch did not help then we either have to optimize rstat
flushing or further increase the update buffer which is nr_cpus * 32.