Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 25 Mar 2016 20:54:35 +0900

Andrew Morton wrote:
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> >  	 * wakeup the thread for old dirty data writeback
> >  	 */
> > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > +	work = kzalloc(sizeof(*work),
> > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >  	if (!work) {
> >  		trace_writeback_nowork(wb);
> >  		wb_wakeup(wb);
> 
> Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> 
> How does this actually all work?  afaict if we fail this
> wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> "hey, there are no work items!" and will do nothing at all.  Or does
> wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> it know how to do that?
> 
> If we had (say) a mempool of wb_writeback_work's (at least for for
> wb_start_writeback), would that help anything?  Or would writeback
> simply fail shortly afterwards for other reasons?
> 

I tried http://lkml.kernel.org/r/20160318133417.GB30225@xxxxxxxxxxxxxx which would
reduce number of wb_writeback_work allocations compared to this patch, and I got
http://lkml.kernel.org/r/201603172035.CJH95337.SOJOFFFHMLOQVt@xxxxxxxxxxxxxxxxxxx
where wb_workfn() got stuck after all when we started using memory reserves.

Having a mempool for wb_writeback_work is not sufficient. There are allocations
after wb_workfn() is called. All allocations (GFP_NOFS or GFP_NOIO) needed for
doing writeback operation are expected to be satisfied. If we let GFP_NOFS and
GFP_NOIO allocations to fail rather than selecting next OOM victim by calling
the OOM killer when the page allocator declared OOM, we will loose data which was
supposed to be flushed asynchronously. Who is happy with buffered writes which
discard data (and causes filesystem errors such as remounting read-only,
followed by killing almost all processes like SysRq-i due to userspace programs
being unable to write data to filesystem) simply because the system was OOM at
that moment? Basically, any allocation (GFP_NOFS or GFP_NOIO) needed for doing
writeback operation is __GFP_NOFAIL because failing to flush data should not occur
unless one of power failure, kernel panic, kernel oops or hardware troubles
occurs. I hate failing to flush data simply because the system was OOM at that
moment, without selecting next OOM victim which would kill fewer processes
compared to consequences caused by filesystem errors.

I expect this patch to merely serve for stop bleeding after we started using
memory reserves. Nothing more. We will need to solve OOM-livelock situation
when we started using memory reserves by killing more processes by calling
the OOM killer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>