On Thu, Jan 9, 2020 at 2:49 PM Pavel Machek <pavel@xxxxxx> wrote: > > Hi! > > > > > > Do we agree that OOM killer should have reacted way sooner? > > > > > > > > This is impossible to answer without knowing what was going on at the > > > > time. Was the system threshing over page cache/swap? In other words, is > > > > the system completely out of memory or refaulting the working set all > > > > the time because it doesn't fit into memory? > > > > > > Swap was full, so "completely out of memory", I guess. Chromium does > > > that fairly often :-(. > > > > The oom heuristic is based on the reclaim failure. If the reclaim makes > > some progress then the oom killer is not hit. Have a look at > > should_reclaim_retry for more details. > > Thanks for pointer. > > I guess setting MAX_RECLAIM_RETRIES to 1 is not something you'd > recommend? :-). > > > > PSI is completely different system, but I guess > > > I should attempt to tweak the existing one first... > > > > PSI is measuring the cost of the allocation (among other things) and > > that can give you some idea on how much time is spent to get memory. > > Userspace can implement a policy based on that and act. The kernel oom > > killer is the last resort when there is really no memory to > > allocate. > > So what I'm seeing is system that is unresponsive, easily for an hour. > > Sometimes, I'm able to log in. When I could do that, system was > absurdly slow, like ps printing at more than 10 seconds per line. > ps on my system takes 300msec, estimate in the slow case would be 2000 > seconds, that is slowdown by factor of 6000x. That would be X terminal > opening in like two hours... that's not really usable. > > DRAM is in 100nsec range, disk is in 10msec range; so worst case > slowdown is somewhere in 100000x range. (Actually, in the worst case > userland will do no progress at all, since you can need at 4+ pages in > single CPU instruction, right?) > > But kernel is happy; system is unusable and will stay unusable for > hour or more, and there's not much user can do. (Besides sysrq, thanks > for the hint). > > Can we do better? This is equivalent of system crash, and it is _way_ > too easy to trigger. Should we do better by default? > > Dunno. If user moved the mouse, and cursor did not move for 10 > seconds, perhaps it is time for oom kill? > > Or should I add more swap? Is it terrible to place swap on SSD? > What's the kernel version? How much memory is anon and file pages? What's your swap to DRAM ratio? Are you using in-memory compression based swap? Have you tried to disable swap completely? Shakeel