Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

Michal Hocko <mhocko@xxxxxxxxxx> · Thu, 17 Mar 2016 13:01:27 +0100

On Thu 17-03-16 20:35:23, Tetsuo Handa wrote:
[...]
> But what I felt strange is what should_reclaim_retry() is doing.
> 
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f77e283fb8c6..b2de8c8761ad 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		 */
> >  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> >  				ac->high_zoneidx, alloc_flags, available)) {
> > -			/* Wait for some write requests to complete then retry */
> > -			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +			unsigned long writeback;
> > +			unsigned long dirty;
> > +
> > +			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> > +			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> > +
> > +			/*
> > +			 * If we didn't make any progress and have a lot of
> > +			 * dirty + writeback pages then we should wait for
> > +			 * an IO to complete to slow down the reclaim and
> > +			 * prevent from pre mature OOM
> > +			 */
> > +			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +				return true;
> > +			}
> 
> writeback and dirty are used only when did_some_progress == 0. Thus, we don't
> need to calculate writeback and dirty using zone_page_state_snapshot() unless
> did_some_progress == 0.

OK, I will move this into if !did_some_progress.

> But, does it make sense to take writeback and dirty into account when
> disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
> wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
> Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

I have considered skiping the throttling part for GFP_NOFS/GFP_NOIO
previously but I couldn't have convinced myself it would make any
difference. We know there was no progress in the reclaim and even if the
current context is doing FS/IO allocation potentially then it obviously
cannot get its memory so it cannot proceed. So now we are in the state
where we either busy loop or sleep for a while. So I ended up not
complicating the code even more. If you have a use case where busy
waiting makes a difference then I would vote for a separate patch with a
clear description.

> > +
> > +			/*
> > +			 * Memory allocation/reclaim might be called from a WQ
> > +			 * context and the current implementation of the WQ
> > +			 * concurrency control doesn't recognize that
> > +			 * a particular WQ is congested if the worker thread is
> > +			 * looping without ever sleeping. Therefore we have to
> > +			 * do a short sleep here rather than calling
> > +			 * cond_resched().
> > +			 */
> > +			if (current->flags & PF_WQ_WORKER)
> > +				schedule_timeout(1);
> 
> This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
> Please update.

Yeah, I have that updated in my local patch already.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>