On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote: > On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > On Tue 06-08-19 10:27:28, Johannes Weiner wrote: > > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote: > > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote: > > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void) > > > > >> return 0; > > > > >> } > > > > >> module_init(psi_proc_init); > > > > >> + > > > > >> +#define OOM_PRESSURE_LEVEL 80 > > > > >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC) > > > > > > > > > > 80% of the last 10 seconds spent in full stall would definitely be a > > > > > problem. If the system was already low on memory (which it probably > > > > > is, or we would not be reclaiming so hard and registering such a big > > > > > stall) then oom-killer would probably kill something before 8 seconds > > > > > are passed. > > > > > > > > If oom killer can act faster, than great! On small embedded systems you probably > > > > don't enable PSI anyway? > > We use PSI triggers with 1 sec tracking window. PSI averages are less > useful on such systems because in 10 secs (which is the shortest PSI > averaging window) memory conditions can change drastically. > > > > > > If my line of thinking is correct, then do we really > > > > > benefit from such additional protection mechanism? I might be wrong > > > > > here because my experience is limited to embedded systems with > > > > > relatively small amounts of memory. > > > > > > > > Well, Artem in his original mail describes a minutes long stall. Things are > > > > really different on a fast desktop/laptop with SSD. I have experienced this as > > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than > > > > 8GB in the laptop). IMHO the default limit should be set so that the user > > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10 > > > > seconds should be fine. > > > > > > That's exactly what I have experienced in the past, and this was also > > > the consistent story in the bug reports we have had. > > > > > > I suspect it requires a certain combination of RAM size, CPU speed, > > > and IO capacity: the OOM killer kicks in when reclaim fails, which > > > happens when all scanned LRU pages were locked and under IO. So IO > > > needs to be slow enough, or RAM small enough, that the CPU can scan > > > all LRU pages while they are temporarily unreclaimable (page lock). > > > > > > It may well be that on phones the RAM is small enough relative to CPU > > > size. > > > > > > But on desktops/servers, we frequently see that there is a wider > > > window of memory consumption in which reclaim efficiency doesn't drop > > > low enough for the OOM killer to kick in. In the time it takes the CPU > > > to scan through RAM, enough pages will have *just* finished reading > > > for reclaim to free them again and continue to make "progress". > > > > > > We do know that the OOM killer might not kick in for at least 20-25 > > > minutes while the system is entirely unresponsive. People usually > > > don't wait this long before forcibly rebooting. In a managed fleet, > > > ssh heartbeat tests eventually fail and force a reboot. > > Got it. Thanks for the explanation. > > > > I'm not sure 10s is the perfect value here, but I do think the kernel > > > should try to get out of such a state, where interacting with the > > > system is impossible, within a reasonable amount of time. > > > > > > It could be a little too short for non-interactive number-crunching > > > systems... > > > > Would it be possible to have a module with tunning knobs as parameters > > and hook into the PSI infrastructure? People can play with the setting > > to their need, we wouldn't really have think about the user visible API > > for the tuning and this could be easily adopted as an opt-in mechanism > > without a risk of regressions. It's relatively easy to trigger a livelock that disables the entire system for good, as a regular user. It's a little weird to make the bug fix for that an opt-in with an extensive configuration interface. This isn't like the hung task watch dog, where it's likely some kind of kernel issue, right? This can happen on any current kernel. What I would like to have is a way of self-recovery from a livelock. I don't mind making it opt-out in case we make mistakes, but the kernel should provide minimal self-protection out of the box, IMO. > PSI averages stalls over 10, 60 and 300 seconds, so implementing 3 > corresponding thresholds would be easy. The patch Johannes posted can > be extended to support 3 thresholds instead of 1. I can take a stab at > it if Johannes is busy. > If we want more flexibility we could use PSI triggers with > configurable tracking window but that's more complex and probably not > worth it. This goes into quality-of-service for workloads territory again. I'm not quite convinced yet we want to go there.