Andrew Morton wrote: > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, > > * This is WB_SYNC_NONE writeback, so if allocation fails just > > * wakeup the thread for old dirty data writeback > > */ > > - work = kzalloc(sizeof(*work), GFP_ATOMIC); > > + work = kzalloc(sizeof(*work), > > + GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN); > > if (!work) { > > trace_writeback_nowork(wb); > > wb_wakeup(wb); > > Oh geeze. fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :( > > How does this actually all work? afaict if we fail this > wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say > "hey, there are no work items!" and will do nothing at all. Or does > wb_workfn() fall into write-1024-pages-anyway mode and if so, how did > it know how to do that? > > If we had (say) a mempool of wb_writeback_work's (at least for for > wb_start_writeback), would that help anything? Or would writeback > simply fail shortly afterwards for other reasons? > I tried http://lkml.kernel.org/r/20160318133417.GB30225@xxxxxxxxxxxxxx which would reduce number of wb_writeback_work allocations compared to this patch, and I got http://lkml.kernel.org/r/201603172035.CJH95337.SOJOFFFHMLOQVt@xxxxxxxxxxxxxxxxxxx where wb_workfn() got stuck after all when we started using memory reserves. Having a mempool for wb_writeback_work is not sufficient. There are allocations after wb_workfn() is called. All allocations (GFP_NOFS or GFP_NOIO) needed for doing writeback operation are expected to be satisfied. If we let GFP_NOFS and GFP_NOIO allocations to fail rather than selecting next OOM victim by calling the OOM killer when the page allocator declared OOM, we will loose data which was supposed to be flushed asynchronously. Who is happy with buffered writes which discard data (and causes filesystem errors such as remounting read-only, followed by killing almost all processes like SysRq-i due to userspace programs being unable to write data to filesystem) simply because the system was OOM at that moment? Basically, any allocation (GFP_NOFS or GFP_NOIO) needed for doing writeback operation is __GFP_NOFAIL because failing to flush data should not occur unless one of power failure, kernel panic, kernel oops or hardware troubles occurs. I hate failing to flush data simply because the system was OOM at that moment, without selecting next OOM victim which would kill fewer processes compared to consequences caused by filesystem errors. I expect this patch to merely serve for stop bleeding after we started using memory reserves. Nothing more. We will need to solve OOM-livelock situation when we started using memory reserves by killing more processes by calling the OOM killer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>