On Thu 09-01-20 23:12:49, Chris Murphy wrote: > On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > On Thu 09-01-20 12:51:48, Michal Hocko wrote: > > > > I've updated the bug, attaching kernel messages and /proc/vmstate in > > > > 1s increments, although quite often during the build multiple seconds > > > > of sampling were just skipped as the system was under too much > > > > pressure. > > > > > > I have a tool to reduce that problem (see attached). > > > > Now for real > > https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6 Let me paste the comment here: : Using Michal Hocko's read_vmstat.c to record ~12 minutes while : running 'ninja' to compile webkitgtk. I killed ninja ~20s before : killing read_vmstat. The system is essentially unusable for 7 minutes : prior to killing ninja which could only be done remotely (which took : maybe 1/2 minute to enter and execute the kill command; ssh was very : sluggish). During those 7 minutes, swap was not even 1/2 full, but the : GUI frozen most of the time including the mouse pointer. file:stdout timeout:1.0s buffer_size:10485760 =S 1578635397 So you have redirected the output (stdout) to a file. This is less effective than using a file directly because the progy makes sure to preallocate and mlock the output file data as well. Anyway, let's have a look what you managed to gather $ grep "=S" bug206117_read_vmstat.out | awk '{diff=$2-prev; if (prev && diff > 1) printf "%d %d\n", prev, diff; prev=$2}' 1578635742 2 1578635755 2 1578635776 2 1578635878 2 1578635943 2 1578635967 2 1578636027 2 1578636053 2 1578636111 3 1578636131 2 1578636138 2 so there are few instances when the next snapshot was not gathered next second. Most of the time it was 2s but that doesn't necessarily mean a problem because the timestamp resolution could have caused that this was just bad timing and the diff was close to 1s. There is one timestamp when this was 3s which looks like a real stall though (this shouldn't happen with the mlocked output file). Let's focus on the timestep though (it is 11min since start): 1578636111 diff pgalloc_dma 4 0 pgalloc_movable 0 0 pgalloc_dma32 6468509 14278 pgalloc_normal 40905599 37286 that's 200M of allocations which is not much pgsteal_kswapd 3384259 10763 pgscan_kswapd 5282874 21772 pgsteal_direct 3189026 30400 pgscan_direct 8372523 85974 which is 106M reclaimed during that time and the reclaim was not particularly effective kswapd 49% and direct reclaim 35%. nr_active_file 26652 -3564 nr_inactive_file 25731 -4772 nr_active_anon 1502798 6433 nr_inactive_anon 283561 4385 File LRUs are quite small (~200M) and got shrunk some while the anonymous LRUs eat almost 8G. while the actively referenced list is 5 times bigger than inactive (5.8G). That suggests that a large part of the memory is heavily in use. pswpin 1222809 4882 pswpout 3475546 7814 Around 30M has been swapped out but around 60% of that amount has been swapped back in and that suggests that we are actively refaulting on the swap. Please note that a large number of anonymous refaults on swap entries might be satisfied from the swap cache. That information is not presented in /proc/vmstate unfortunately. On the page cache front workingset_activate 661011 7566 workingset_refault 1894851 14868 workingset_restore 398375 4622 50% of refaults are active and 30% are marked as a workingset. Now let's see how those numbers evolve over time $ grep pswp bug206117_read_vmstat.out | awk '/pswpin/{pswpin=$2} /pswpout/{pswpout=$2; if (pswpout) {print pswpin*100/pswpout}}' | calc_min_max.awk min: 0.06 max: 35.28 avg: 22.49 std: 11.30 nr: 592 The swapin refault builds up over time which matches the expectation that the memory demand builds up as well and it doesn't fit into RAM while the workload still needs the memory that has been swapped out. The first swapout happens shortly before we start the direct reclaim: $ awk '/=S/{ts=$2} /allocstall/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out 1578635582 $ awk '/=S/{ts=$2} /pswpout/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out 1578635581 Page cache refault have a slightly higher dynamics $ grep workingset bug206117_read_vmstat.out | awk '/workingset_refault/{workingset_refault=$2} /workingset_activate/{workingset_activate=$2; if (workingset_refault) {print workingset_activate*100/workingset_refault}}' 100 91.453 91.453 91.6201 91.6201 96.587 96.7213 96.7638 96.7777 97.1401 70.2157 [...] 22.4038 22.4447 22.3699 22.3929 22.3724 22.3897 22.3873 22.3846 22.3847 [...] 34.9863 34.9863 34.9847 34.9847 34.9847 34.9846 34.9239 which means that as soon as we start refaulting it is the activating memory then the activation decreases for some time, most likely because we start swapping out more and the activation rate stays around 30% in the end. All that being said numbers confirm my initial suspicion. The workload you are trying to run simply doesn't fit into memory and therefore it has to refault both from the disk for the page cache (likely also executables) and the swap space. It is not really obvious whether changing file vs. anonymous reclaim balance would help much in this case. As we can see swap out starts before tasks are stalled in the direct reclaim. It is not really clear whether swapping out more would help to put some relief on the pagecache. So to me it looks like the reclaim is not doing a terrible job here considering the memory demand. It would interesting to see whether tuning vm_swappiness to 100 helps but considering how large is the anonymous active list I would be very skeptical. So in the end it is really hard to see what the kernel should have done better in this overcommitted case. Killing memory hogs would likely kill an active workload which would lead to better desktop experience but I can imagine setups which simply want to have work done albeit sloooowly. -- Michal Hocko SUSE Labs