Hello, Jan. Assuming that you will find a better solution, can we apply this patch for now to stop bleeding? This problem frequently prevents me from testing OOM livelock condition. Jan Kara wrote: > On Tue 29-03-16 10:54:35, Michal Hocko wrote: > > On Thu 24-03-16 14:17:14, Andrew Morton wrote: > > > On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > Andrew, can you take this patch? > > > > > > Tejun. > > > > > > > ---------------------------------------- > > > > >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001 > > > > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > > > > Date: Sun, 13 Mar 2016 23:03:05 +0900 > > > > Subject: [PATCH] mm,writeback: Don't use memory reserves for > > > > wb_start_writeback > > > > > > > > When writeback operation cannot make forward progress because memory > > > > allocation requests needed for doing I/O cannot be satisfied (e.g. > > > > under OOM-livelock situation), we can observe flood of order-0 page > > > > allocation failure messages caused by complete depletion of memory > > > > reserves. > > > > > > > > This is caused by unconditionally allocating "struct wb_writeback_work" > > > > objects using GFP_ATOMIC from PF_MEMALLOC context. > > > > > > > > __alloc_pages_nodemask() { > > > > __alloc_pages_slowpath() { > > > > __alloc_pages_direct_reclaim() { > > > > __perform_reclaim() { > > > > current->flags |= PF_MEMALLOC; > > > > try_to_free_pages() { > > > > do_try_to_free_pages() { > > > > wakeup_flusher_threads() { > > > > wb_start_writeback() { > > > > kzalloc(sizeof(*work), GFP_ATOMIC) { > > > > /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */ > > > > } > > > > } > > > > } > > > > } > > > > } > > > > current->flags &= ~PF_MEMALLOC; > > > > } > > > > } > > > > } > > > > } > > > > > > > > Since I/O is stalling, allocating writeback requests forever shall deplete > > > > memory reserves. Fortunately, since wb_start_writeback() can fall back to > > > > wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't > > > > need to allow wb_start_writeback() to use memory reserves. > > > > > > > > ... > > > > > > > > --- a/fs/fs-writeback.c > > > > +++ b/fs/fs-writeback.c > > > > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, > > > > * This is WB_SYNC_NONE writeback, so if allocation fails just > > > > * wakeup the thread for old dirty data writeback > > > > */ > > > > - work = kzalloc(sizeof(*work), GFP_ATOMIC); > > > > + work = kzalloc(sizeof(*work), > > > > + GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN); > > > > if (!work) { > > > > trace_writeback_nowork(wb); > > > > wb_wakeup(wb); > > > > > > Oh geeze. fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :( > > > > > > How does this actually all work? > > > > Jack has explained it a bit > > http://lkml.kernel.org/r/20160318131136.GE7152@xxxxxxxxxxxxx > > > > > afaict if we fail this > > > wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say > > > "hey, there are no work items!" and will do nothing at all. Or does > > > wb_workfn() fall into write-1024-pages-anyway mode and if so, how did > > > it know how to do that? > > We will end up in wb_do_writeback() which finds there's no work item so it > falls back to doing default background writeback (i.e., write out until > number of dirty pages is below background_dirty_limit). > > > > If we had (say) a mempool of wb_writeback_work's (at least for for > > > wb_start_writeback), would that help anything? Or would writeback > > > simply fail shortly afterwards for other reasons? > > Not sure mempools would significantly improve the situation. Writeback code > is able to deal with the failed allocation so I think the issue remains > more with writeback code mostly pointlessly exhausting memory reserves with > atomic allocations. > > I think it is somewhat dumb from do_try_to_free_pages() that it calls > wakeup_flusher_threads() so often (I guess it can quickly end up asking to > write more than it is ever sensible to ask). Admittedly it is also dumb from > the writeback code that it is not able to merge requests for writeback - we > could easily merge items created by wb_start_writeback() with matching > 'reason' and 'range_cyclic'. > > I'm not sure how easy it is to fix the first thing, I think improving the > second one may be worth it and I can have a look at that. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>