On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote: > On Mon 05-08-19 14:13:16, Vlastimil Babka wrote: > > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote: > > > Hello, > > > > > > There's this bug which has been bugging many people for many years > > > already and which is reproducible in less than a few minutes under the > > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to > > > defaults. > > > > > > Steps to reproduce: > > > > > > 1) Boot with mem=4G > > > 2) Disable swap to make everything faster (sudo swapoff -a) > > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox > > > 4) Start opening tabs in either of them and watch your free RAM decrease > > > > > > Once you hit a situation when opening a new tab requires more RAM than > > > is currently available, the system will stall hard. You will barely be > > > able to move the mouse pointer. Your disk LED will be flashing > > > incessantly (I'm not entirely sure why). You will not be able to run new > > > applications or close currently running ones. > > > > > This little crisis may continue for minutes or even longer. I think > > > that's not how the system should behave in this situation. I believe > > > something must be done about that to avoid this stall. > > > > Yeah that's a known problem, made worse SSD's in fact, as they are able > > to keep refaulting the last remaining file pages fast enough, so there > > is still apparent progress in reclaim and OOM doesn't kick in. > > > > At this point, the likely solution will be probably based on pressure > > stall monitoring (PSI). I don't know how far we are from a built-in > > monitor with reasonable defaults for a desktop workload, so CCing > > relevant folks. > > Another potential approach would be to consider the refault information > we have already for file backed pages. Once we start reclaiming only > workingset pages then we should be trashing, right? It cannot be as > precise as the cost model which can be defined around PSI but it might > give us at least a fallback measure. NAK, this does *not* work. Not even as fallback. There is no amount of refaults for which you can say whether they are a problem or not. It depends on the disk speed (obvious) but also on the workload's memory access patterns (somewhat less obvious). For example, we have workloads whose cache set doesn't quite fit into memory, but everything else is pretty much statically allocated and it rarely touches any new or one-off filesystem data. So there is always a steady rate of mostly uninterrupted refaults, however, most data accesses are hitting the cache! And we have fast SSDs that compensate for the refaults that do occur. The workload runs *completely fine*. If the cache hit rate was lower and refaults would make up a bigger share of overall page accesses, or if there was a spinning disk in that machine, the machine would be completely livelocked - with the same exact number of refaults and the same amount of RAM! That's not just an approximation error that we could compensate for. The same rate of refaults in a system could mean anything from 0% (all refaults readahead, and IO is done before workload notices) to 100% memory pressure (all refaults are cache misses and workload fully serialized on pages in question) - and anything in between (a subset of threads of the workload wait for a subset of the refaults). The refault rate by itself carries no signal on workload progress. This is the whole reason why psi was developed - to compare the time you spend on refaults (encodes IO speed and readhahead efficiency) compared to the time you spend on being productive (encodes refaults as share of overall memory accesses of a the workload).