On Wed, Feb 23, 2022 at 9:16 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote: > > Thanks for the info. Is it possible to test > https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/ We built the patch on top of 5.15.25 and deployed it with swap-on-zram on some nodes. Test data follow: 1) Latency of workingset_refault over 60 seconds $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 4643 |********************** | 512 -> 1023 : 4259 |******************** | 1024 -> 2047 : 1178 |***** | 2048 -> 4095 : 4044 |******************* | 4096 -> 8191 : 1804 |******** | 8192 -> 16383 : 109 | | 16384 -> 32767 : 52 | | 32768 -> 65535 : 45 | | 65536 -> 131071 : 15 | | 131072 -> 262143 : 2 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 7 | | 1048576 -> 2097151 : 900 |**** | 2097152 -> 4194303 : 8218 |****************************************| 4194304 -> 8388607 : 967 |**** | 8388608 -> 16777215 : 89 | | avg = 1229607 nsecs, total: 32378027676 nsecs, count: 26332 2) Latency of cgroup_rstat_flush_locked over 60 seconds $ sudo /usr/share/bcc/tools/funclatency -d 60 cgroup_rstat_flush_locked Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 36 | | 2048 -> 4095 : 73 | | 4096 -> 8191 : 151 | | 8192 -> 16383 : 606 |** | 16384 -> 32767 : 387 |* | 32768 -> 65535 : 49 | | 65536 -> 131071 : 26 | | 131072 -> 262143 : 6 | | 262144 -> 524287 : 3 | | 524288 -> 1048575 : 14 | | 1048576 -> 2097151 : 2085 |********* | 2097152 -> 4194303 : 8580 |****************************************| 4194304 -> 8388607 : 772 |*** | 8388608 -> 16777215 : 117 | | 16777216 -> 33554431 : 1 | | avg = 2592894 nsecs, total: 33463898782 nsecs, count: 12906 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+440 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3102 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 1 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+333 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3649 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 1 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+333 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3102 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 4 @stackcount[ mem_cgroup_flush_stats+1 shrink_node+163 balance_pgdat+791 kswapd+508 kthread+295 ret_from_fork+34 ]: 7 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+569 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 10 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 111 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 225 This was collected with the following bpftrace script kprobe:mem_cgroup_flush_stats { @start[tid] = nsecs; @stack[tid] = kstack; } kretprobe:mem_cgroup_flush_stats /@start[tid]/ { $usecs = (nsecs - @start[tid]) / 1000; if ($usecs >= 5000) { @stackcount[@stack[tid]]++; } delete(@start[tid]); delete(@stack[tid]); } interval:s:60 { print(@stackcount); exit(); } > > If that patch did not help then we either have to optimize rstat > flushing or further increase the update buffer which is nr_cpus * 32. I think overall, we don't see any significant improvements. The appearance of expensive mem_cgroup_flush_stats is still there.