On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote: [...] > Both of these logs are attached in a tarball. Thanks! First of all $ grep pswp vmstat.1582318979 pswpin 0 pswpout 0 suggests that you do not have any swap storage, right? I will get back to this later. Now, let's have a look at snapshots. We have regular 1s snapshots intially but then we have vmstat.1582318734 vmstat.1582318736 vmstat.1582318758 vmstat.1582318763 vmstat.1582318768 [...] vmstat.1582318965 vmstat.1582318975 vmstat.1582318976 That is 242s time period when even a simple bash script was struggling to write a snapshot of a /proc/vmstat which by itself shouldn't really depend on the system activity much. Let's have a look at a random chosen two consecutive snapshots from this time period: vmstat.1582318736 vmstat.1582318758 base diff allocstall_dma 0 0 allocstall_dma32 0 0 allocstall_movable 5773 0 allocstall_normal 906 0 to my surprise there was no invocation of the direct reclaim in this time period. I would expect this to be the case considering the long stall. But the source of the stall might be different than the DR. compact_stall 13 1 Direct compaction has been invoked but this shouldn't cause a major stall for all processes. nr_active_anon 133932 236 nr_inactive_anon 9350 -1179 nr_active_file 318 190 nr_inactive_file 673 56 nr_unevictable 51984 0 The amount of anonymous memory is not really high (~560MB) but file LRU is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to ~760M of memory which is 74% of the memory. Please note that your mem=2G setup gives you only 1G of memory in fact (based on the zone_info you have posted). That is not something unusual but the amount of the page cache is worrying because I would expect a heavy trashing because most of the executables are going to require major faults. Anonymous memory is not swapped out obviously so there is no other option than to refault constantly. pgscan_kswapd 64788716 14157035 pgsteal_kswapd 29378868 4393216 pswpin 0 0 pswpout 0 0 workingset_activate 3840226 169674 workingset_refault 29396942 4393013 workingset_restore 2883042 106358 And here we can see it clearly happening. Note how working set refaults matches the amount of memory reclaimed by kswapd. I would be really curious whether adding swap space would help some. Now to your patch and why it helps here. It seems quite obvious that the only effectively reclaimable memory (page cache) is not going to satisfy the high watermark target Node 0, zone DMA32 pages free 87925 min 11090 low 13862 high 16634 kswapd has some feedback mechanism to back off when the zone is hopless from the reclaim point of view AFAIR but it seems it has failed in this particular situation. It should have relied on the direct reclaim and eventually trigger the OOM killer. Your patch has worked around this by bailing out from the kswapd reclaim too early so a part of the page cache required for the code to move on would stay resident and move further. The proper fix should, however, check the amount of reclaimable pages and back off if they cannot meet the target IMO. We cannot rely on the general reclaimability here because that could really be thrashing. Thoughts? -- Michal Hocko SUSE Labs