On Mon, Dec 22, 2014 at 03:24:35PM +0100, Michal Hocko wrote: > On Sat 20-12-14 17:18:24, Vladimir Davydov wrote: > > On Sat, Dec 20, 2014 at 11:47:46AM +0100, Michal Hocko wrote: > > > On Fri 19-12-14 21:28:15, Vladimir Davydov wrote: > > > > So AFAIU the problem does exist. However, I think it could be fixed by > > > > simply waking up all processes waiting on pfmemalloc_wait before putting > > > > kswapd to sleep: > > > > > > I think that a simple cond_resched() in kswapd_try_to_sleep should be > > > sufficient and less risky fix, so basically what Vlastimil was proposing > > > in the beginning. > > > > With such a solution we implicitly rely upon the scheduler > > implementation, which AFAIU is wrong. > > But this is a scheduling problem, isn't it? !PREEMPT kernel with a > kernel thread looping without a scheduling point which prevents the > woken task to run due to cpu affinity... I wouldn't say so. To me it looks more like an abuse of the workqueue API. AFAIU an abstract scheduler algorithm isn't supposed to guarantee that an arbitrary process will ever get scheduled unless the CPU is idle. Of course, my example below looks contrived. Nobody will raise kswapd priority for it to preempt other processes. However, IMO we shouldn't rely on that in the mm subsys, which is rather orthogonal to the sched. > > > E.g. suppose processes are > > governed by FIFO and kswapd happens to have a higher prio than the > > process killed by OOM. Then after cond_resched kswapd will be picked for > > execution again, and the killing process won't have a chance to remove > > itself from the wait queue. > > Except that kswapd runs as SCHED_NORMAL with 0 priority. > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 744e2b491527..2a123634c220 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -2984,6 +2984,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, > > > > if (remaining) > > > > return false; > > > > > > > > + if (!pgdat_balanced(pgdat, order, classzone_idx)) > > > > + return false; > > > > + > > > > > > What would be consequences of not waking up pfmemalloc waiters while the > > > node is not balanced? > > > > They will get woken up a bit later in balanced_pgdat. This might result > > in latency spikes though. In order not to change the original behaviour > > we could always wake all pfmemalloc waiters no matter if we are going to > > sleep or not: > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 744e2b491527..a21e0bd563c3 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2993,10 +2993,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, > > * so wake them now if necessary. If necessary, processes will wake > > * kswapd and get throttled again > > */ > > - if (waitqueue_active(&pgdat->pfmemalloc_wait)) { > > - wake_up(&pgdat->pfmemalloc_wait); > > - return false; > > - } > > + wake_up_all(&pgdat->pfmemalloc_wait); > > > > return pgdat_balanced(pgdat, order, classzone_idx); > > So you are relying on scheduling points somewhere down the > balance_pgdat. That should be sufficient. I am still quite surprised > that we have an OOM victim still on the queue and balanced pgdat here > because OOM victim didn't have chance to free memory. So somebody else > must have released a lot of memory after OOM. > > This patch seems better than the one from Vlastimil. Care to post it > with the full changelog, please? Attached below (merged with 2/2). I haven't checked that it does fix the issue, because I don't have the reproducer, so it should be committed only if Vlastimil approves it. From: Vlastimil Babka <vbabka@xxxxxxx> Subject: [PATCH] mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep() and do not put itself to sleep if there are any: if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to avoid the race between kswapd checking pfmemalloc_wait and a process getting throttled as the comment in prepare_kswapd_sleep() explains. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue empties. To eliminate the race, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Also, it replaces wake_up with wake_up_all in balance_pgdat(), because: - using wake_up there might leave processes waiting for longer than necessary, until the check is reached in the next loop iteration; - processes might also be left waiting even if zone was fully balanced in single iteration; - the comment says "wake them" so the author of the commit that introduced pfmemalloc_wait seemed to mean wake_up_all; - this corresponds to how we wake processes waiting on pfmemalloc_wait in prepare_kswapd_sleep. Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> Signed-off-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> # v3.6+ Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> --- Changes in v2: - instead of introducing yet another schedule() point in kswapd_try_to_sleep(), allow kswapd to sleep even if the pfmemalloc_wait queue is active, waking *all* throttled processes before going to sleep mm/vmscan.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8772b2b9ef..65287944b2cf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2961,10 +2961,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, * so wake them now if necessary. If necessary, processes will wake * kswapd and get throttled again */ - if (waitqueue_active(&pgdat->pfmemalloc_wait)) { - wake_up(&pgdat->pfmemalloc_wait); - return false; - } + if (waitqueue_active(&pgdat->pfmemalloc_wait)) + wake_up_all(&pgdat->pfmemalloc_wait); return pgdat_balanced(pgdat, order, classzone_idx); } @@ -3205,7 +3203,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, */ if (waitqueue_active(&pgdat->pfmemalloc_wait) && pfmemalloc_watermark_ok(pgdat)) - wake_up(&pgdat->pfmemalloc_wait); + wake_up_all(&pgdat->pfmemalloc_wait); /* * Fragmentation may mean that the system cannot be rebalanced -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html