On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote: > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > > I guess this new patch is more problem oriented and acceptable: > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > > count_vm_events(PGDEACTIVATE, nr_active); > > > > > > nr_freed += shrink_page_list(&page_list, sc, > > > - PAGEOUT_IO_SYNC); > > > + priority < DEF_PRIORITY / 3 ? > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > > } > > > > > > nr_reclaimed += nr_freed; > > > > This one looks better: > > --- > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > > > Fix "system goes totally unresponsive with many dirty/writeback pages" > > problem: > > > > http://lkml.org/lkml/2010/4/4/86 > > > > The root cause is, wait_on_page_writeback() is called too early in the > > direct reclaim path, which blocks many random/unrelated processes when > > some slow (USB stick) writeback is on the way. > > > > So, what's the bet if lumpy reclaim is a factor that it's > high-order-but-low-cost such as fork() that are getting caught by this since > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] > was introduced? Sorry I'm a bit confused by your wording.. > That could manifest to the user as stalls creating new processes when under > heavy IO. I would be surprised it would freeze the entire system but certainly > any new work would feel very slow. > > > A simple dd can easily create a big range of dirty pages in the LRU > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > typical desktop, which triggers the lumpy reclaim mode and hence > > wait_on_page_writeback(). > > > > which triggers the lumpy reclaim mode for high-order allocations. Exactly. Changelog updated. > lumpy reclaim mode is not something that is triggered just because priority > is high. Right. > I think there is a second possibility for causing stalls as well that is > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may > also result in stalls. If it is taking a long time to writeback dirty data, > random processes could be getting stalled just because they happened to dirty > data at the wrong time. This would be the case if the main dirtying process > (e.g. dd) is not calling sync and dropping pages it's no longer using. The dirty_limit throttling will slow down the dirty process to the writeback throughput. If a process is dirtying files on sda (HDD), it will be throttled at 80MB/s. If another process is dirtying files on sdb (USB 1.1), it will be throttled at 1MB/s. So dirty throttling will slow things down. However the slow down should be smooth (a series of 100ms stalls instead of a sudden 10s stall), and won't impact random processes (which does no read/write IO at all). > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > > the 22MB writeback and 190MB dirty pages. There can easily be a > > continuous range of 512KB dirty/writeback pages in the LRU, which will > > trigger the wait logic. > > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 > > seconds. > > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3, > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait > > will hardly be triggered by pure dirty pages. > > > > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > > --- > > mm/vmscan.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis > > * but that should be acceptable to the caller > > */ > > if (nr_freed < nr_taken && !current_is_kswapd() && > > - sc->lumpy_reclaim_mode) { > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > This will also delay waiting on congestion for really high-order > allocations such as huge pages, some video decoder and the like which > really should be stalling. I absolutely agree that high order allocators should be somehow throttled. However given that one can easily create a large _continuous_ range of dirty LRU pages, let someone bumping all the way through the range sounds a bit cruel.. > How about the following compile-tested diff? > It takes the cost of the high-order allocation into account and the > priority when deciding whether to synchronously wait or not. Very nice patch. Thanks! Cheers, Fengguang > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9c7e57c..d652e0c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file, > } > > /* > + * Returns true if the caller should stall on congestion and retry to clean > + * the list of pages synchronously. > + * > + * If we are direct reclaiming for contiguous pages and we do not reclaim > + * everything in the list, try again and wait for IO to complete. This > + * will stall high-order allocations but that should be acceptable to > + * the caller > + */ > +static inline bool should_reclaim_stall(unsigned long nr_taken, > + unsigned long nr_freed, > + int priority, > + struct scan_control *sc) > +{ > + int lumpy_stall_priority; > + > + /* kswapd should not stall on sync IO */ > + if (current_is_kswapd()) > + return false; > + > + /* Only stall on lumpy reclaim */ > + if (!sc->lumpy_reclaim_mode) > + return false; > + > + /* If we have relaimed everything on the isolated list, no stall */ > + if (nr_freed == nr_taken) > + return false; > + > + /* > + * For high-order allocations, there are two stall thresholds. > + * High-cost allocations stall immediately where as lower > + * order allocations such as stacks require the scanning > + * priority to be much higher before stalling > + */ > + if (sc->order > PAGE_ALLOC_COSTLY_ORDER) > + lumpy_stall_priority = DEF_PRIORITY; > + else > + lumpy_stall_priority = DEF_PRIORITY / 3; > + > + return priority <= lumpy_stall_priority; > +} > + > +/* > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > */ > @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, > nr_scanned += nr_scan; > nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > - /* > - * If we are direct reclaiming for contiguous pages and we do > - * not reclaim everything in the list, try again and wait > - * for IO to complete. This will stall high-order allocations > - * but that should be acceptable to the caller > - */ > - if (nr_freed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > + /* Check if we should syncronously wait for writeback */ > + if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html