Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback

Jan Kara <jack@xxxxxxx> · Tue, 29 Mar 2016 18:49:42 +0200

On Tue 29-03-16 10:54:35, Michal Hocko wrote:
> On Thu 24-03-16 14:17:14, Andrew Morton wrote:
> > On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> > 
> > > Andrew, can you take this patch?
> > 
> > Tejun.
> > 
> > > ----------------------------------------
> > > >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> > > Date: Sun, 13 Mar 2016 23:03:05 +0900
> > > Subject: [PATCH] mm,writeback: Don't use memory reserves for
> > >  wb_start_writeback
> > > 
> > > When writeback operation cannot make forward progress because memory
> > > allocation requests needed for doing I/O cannot be satisfied (e.g.
> > > under OOM-livelock situation), we can observe flood of order-0 page
> > > allocation failure messages caused by complete depletion of memory
> > > reserves.
> > > 
> > > This is caused by unconditionally allocating "struct wb_writeback_work"
> > > objects using GFP_ATOMIC from PF_MEMALLOC context.
> > > 
> > > __alloc_pages_nodemask() {
> > >   __alloc_pages_slowpath() {
> > >     __alloc_pages_direct_reclaim() {
> > >       __perform_reclaim() {
> > >         current->flags |= PF_MEMALLOC;
> > >         try_to_free_pages() {
> > >           do_try_to_free_pages() {
> > >             wakeup_flusher_threads() {
> > >               wb_start_writeback() {
> > >                 kzalloc(sizeof(*work), GFP_ATOMIC) {
> > >                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
> > >                 }
> > >               }
> > >             }
> > >           }
> > >         }
> > >         current->flags &= ~PF_MEMALLOC;
> > >       }
> > >     }
> > >   }
> > > }
> > > 
> > > Since I/O is stalling, allocating writeback requests forever shall deplete
> > > memory reserves. Fortunately, since wb_start_writeback() can fall back to
> > > wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> > > need to allow wb_start_writeback() to use memory reserves.
> > > 
> > > ...
> > >
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> > >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> > >  	 * wakeup the thread for old dirty data writeback
> > >  	 */
> > > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > > +	work = kzalloc(sizeof(*work),
> > > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > >  	if (!work) {
> > >  		trace_writeback_nowork(wb);
> > >  		wb_wakeup(wb);
> > 
> > Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> > 
> > How does this actually all work?
> 
> Jack has explained it a bit
> http://lkml.kernel.org/r/20160318131136.GE7152@xxxxxxxxxxxxx
> 
> > afaict if we fail this
> > wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> > "hey, there are no work items!" and will do nothing at all.  Or does
> > wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> > it know how to do that?

We will end up in wb_do_writeback() which finds there's no work item so it
falls back to doing default background writeback (i.e., write out until
number of dirty pages is below background_dirty_limit).

> > If we had (say) a mempool of wb_writeback_work's (at least for for
> > wb_start_writeback), would that help anything?  Or would writeback
> > simply fail shortly afterwards for other reasons?

Not sure mempools would significantly improve the situation. Writeback code
is able to deal with the failed allocation so I think the issue remains
more with writeback code mostly pointlessly exhausting memory reserves with
atomic allocations.

I think it is somewhat dumb from do_try_to_free_pages() that it calls
wakeup_flusher_threads() so often (I guess it can quickly end up asking to
write more than it is ever sensible to ask). Admittedly it is also dumb from
the writeback code that it is not able to merge requests for writeback - we
could easily merge items created by wb_start_writeback() with matching
'reason' and 'range_cyclic'.

I'm not sure how easy it is to fix the first thing, I think improving the
second one may be worth it and I can have a look at that.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>