Re: Regression in workingset_refault latency on 5.15

Ivan Babrou <ivan@xxxxxxxxxxxxxx> · Wed, 23 Feb 2022 11:28:39 -0800

> Can you please first confirm if you are running 5.15.19 kernel? If
> that is the case then it means the machine is seeing (nr_cpus * 32)
> stat updates.

Yes, we're on 5.15.19.

> Couple of questions:
> 1) Did you see any call stack other than xfs read where expensive
> mem_cgroup_flush_stats() is showing up?

Yes:

mem_cgroup_flush_stats: 7717 us
stack:
        mem_cgroup_flush_stats+1
        workingset_refault+296
        __read_swap_cache_async+512
        swapin_readahead+817
        do_swap_page+561
        __handle_mm_fault+2318
        handle_mm_fault+191
        do_user_addr_fault+428
        exc_page_fault+103
        asm_exc_page_fault+30

> 2) Can you please use the similar bpf+kprobe tracing for the
> memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> source of frequent stat updates.

"memcg_rstat_updated" is "static inline".

With the following:

bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'

The top stacks are:

@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    clear_page_dirty_for_io+292
    write_cache_pages+323
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
    __writeback_inodes_wb+76
]: 36429
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    __test_set_page_writeback+421
    iomap_do_writepage+1241
    write_cache_pages+393
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
]: 36431
@stacks[
    __mod_memcg_lruvec_state+1
    mod_objcg_state+421
    memcg_slab_post_alloc_hook+345
    __kmalloc_node_track_caller+386
    __alloc_skb+139
    alloc_skb_with_frags+75
    sock_alloc_send_pskb+495
    unix_stream_sendmsg+599
    sock_sendmsg+94
    sock_write_iter+151
]: 39459
@stacks[
    __mod_memcg_lruvec_state+1
    release_pages+760
    tlb_finish_mmu+89
    zap_page_range+265
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54648
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_remove_rmap+85
    unmap_page_range+1413
    zap_page_range+224
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54719
@stacks[
    __mod_memcg_lruvec_state+1
    __pagevec_lru_add+256
    lru_cache_add+82
    __handle_mm_fault+4686
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 98103
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_add_new_anon_rmap+78
    __handle_mm_fault+4675
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 105071

> 3) I am still pondering why disabling swap resolves the issue for you.
> Is that only for a workload different from xfs read?

My understanding is that any block IO (including swap) triggers new
memcg accounting code. In our process we don't have any other IO than
swap, so disabling swap removes the major (if not only) vector of
triggering this issue.

> 4) Can you please also trace cgroup_rstat_flush_locked()?

Stacks are:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }'
Attaching 1 probe...
^C

@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    shrink_node+163
    balance_pgdat+777
    kswapd+508
    kthread+295
    ret_from_fork+34
]: 1
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    __read_swap_cache_async+512
    swapin_readahead+817
    do_swap_page+561
    __handle_mm_fault+2318
    handle_mm_fault+191
    do_user_addr_fault+428
]: 5
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+569
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 18
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 31
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    force_page_cache_ra+226
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
]: 223

Let me know if you only need the slow ones.

I should mention that there are really two issues:

1. Expensive workingset_refault, which shows up on flamegraphs. We see
it for our rocksdb based database, which persists data on xfs (local
nvme).
2. Expensive workingset_refault that causes latency hiccups, but
doesn't show up on flamegraphs. We see it in our nginx based proxy
with swap enabled (either zram or regular file on xfs on local nvme).

We solved the latter by disabling swap. I think the proper solution
would be for workingset_refault to be fast enough to be invisible, in
line with what was happening on Linux 5.10.