On Thu 19-11-15 15:12:39, David Rientjes wrote: > On Wed, 18 Nov 2015, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@xxxxxxxx> > > > > wait_iff_congested has been used to throttle allocator before it retried > > another round of direct reclaim to allow the writeback to make some > > progress and prevent reclaim from looping over dirty/writeback pages > > without making any progress. We used to do congestion_wait before > > 0e093d99763e ("writeback: do not sleep on the congestion queue if > > there are no congested BDIs or if significant congestion is not being > > encountered in the current zone") but that led to undesirable stalls > > and sleeping for the full timeout even when the BDI wasn't congested. > > Hence wait_iff_congested was used instead. But it seems that even > > wait_iff_congested doesn't work as expected. We might have a small file > > LRU list with all pages dirty/writeback and yet the bdi is not congested > > so this is just a cond_resched in the end and can end up triggering pre > > mature OOM. > > > > This patch replaces the unconditional wait_iff_congested by > > congestion_wait which is executed only if we _know_ that the last round > > of direct reclaim didn't make any progress and dirty+writeback pages are > > more than a half of the reclaimable pages on the zone which might be > > usable for our target allocation. This shouldn't reintroduce stalls > > fixed by 0e093d99763e because congestion_wait is called only when we > > are getting hopeless when sleeping is a better choice than OOM with many > > pages under IO. > > > > Why HZ/10 instead of HZ/50? My idea was to give the writeback more time. As we only wait when there is a lot of dirty/writeback data it shouldn't stall pointlessly. > > Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> > > --- > > mm/page_alloc.c | 16 ++++++++++++++-- > > 1 file changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 020c005c5bc0..e6271bc19e6a 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -3212,8 +3212,20 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > */ > > if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > > ac->high_zoneidx, alloc_flags, target)) { > > - /* Wait for some write requests to complete then retry */ > > - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > > + unsigned long writeback = zone_page_state(zone, NR_WRITEBACK), > > + dirty = zone_page_state(zone, NR_FILE_DIRTY); > > + > > + /* > > + * If we didn't make any progress and have a lot of > > + * dirty + writeback pages then we should wait for > > + * an IO to complete to slow down the reclaim and > > + * prevent from pre mature OOM > > + */ > > + if (!did_some_progress && 2*(writeback + dirty) > reclaimable) > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > The purpose of the heuristic seems logical, but I'm concerned about the > threshold for determining when to wait and when to just resched and retry > again. > > This triggers for environments without swap when > > 2 * (NR_WRITEBACK + NR_DIRTY) > (NR_ACTIVE_FILE + NR_INACTIVE_FILE + > NR_ISOLATED_FILE + NR_ISOLATED_ANON) > > [ The use of NR_ISOLATED_ANON in swapless is asked about in patch 1. ] > > How exactly was this chosen? Why not when the two sides equal each other? The idea was to stall when the to-be-flushed pages form at least half of the reclaimable memory which sounds like an easy concept to start with. This worked reasonably well in my OOM stress tests but I am opened to suggestions. Ideally this should be a function of the writeback speed and maybe we will get there one day but I would like to start with something simple which works most of the time. > > > + else > > + cond_resched(); > > + > > goto retry; > > } > > } Thanks -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>