On Wed, Feb 23, 2022 at 11:28 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote: > [...] > > 2) Can you please use the similar bpf+kprobe tracing for the > > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the > > source of frequent stat updates. > > "memcg_rstat_updated" is "static inline". > > With the following: > > bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }' > [...] Thanks, it is helpful. It seems like most of the stats updates are happening on the anon page faults and based on signature, it seems like swap refaults. > > > 3) I am still pondering why disabling swap resolves the issue for you. > > Is that only for a workload different from xfs read? > > My understanding is that any block IO (including swap) triggers new > memcg accounting code. In our process we don't have any other IO than > swap, so disabling swap removes the major (if not only) vector of > triggering this issue. > Now, I understand why disabling swap is helping your case as the number of stat updates would be reduced drastically and rstat flush would happen async most of the time. [...] > I should mention that there are really two issues: > > 1. Expensive workingset_refault, which shows up on flamegraphs. We see > it for our rocksdb based database, which persists data on xfs (local > nvme). > 2. Expensive workingset_refault that causes latency hiccups, but > doesn't show up on flamegraphs. We see it in our nginx based proxy > with swap enabled (either zram or regular file on xfs on local nvme). > > We solved the latter by disabling swap. I think the proper solution > would be for workingset_refault to be fast enough to be invisible, in > line with what was happening on Linux 5.10. Thanks for the info. Is it possible to test https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/ ? If that patch did not help then we either have to optimize rstat flushing or further increase the update buffer which is nr_cpus * 32.