On Sat 15-10-11 00:28:07, Wu Fengguang wrote: > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote: > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote: > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote: > > > > > > + long pause = 1; > > > > > > + long max_pause = dirty_writeback_interval ? > > > > > > + msecs_to_jiffies(dirty_writeback_interval * 10) : > > > > > > + HZ; > > > > > > > > > > It's better not to put the flusher to sleeps more than 10ms, so that > > > > > when the condition changes, we don't risk making the storage idle for > > > > > too long time. > > > > > > > > Yeah, the one big regression case > > > > > > > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue6+ > > > > ------------------------ ------------------------ > > > > 47.07 -15.5% 39.78 thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X > > > > > > > > is exactly caused by the large sleep: the attached graphs are showing > > > > one period of no-progress on the number of written pages. > > > Thanks for the tests! Interesting. Do you have trace file from that run? > > > I see the writeback stalled for 20s or so which is more than > > > dirty_writeback_centisecs so I think something more complicated must have > > > happened. > > > > I noticed that > > > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence > > the dd tasks are hard blocked in balance_dirty_pages(). > > > > flush-8:0-1170 [004] 211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447 > > > > 2) the flusher thread is not woken up because we test writeback_in_progress() > > in balance_dirty_pages(). > > > > if (unlikely(!writeback_in_progress(bdi))) > > bdi_start_background_writeback(bdi); > > > > Thus the flusher thread wait and wait as in below trace. > > > > flush-8:0-1170 [004] 211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447 > > flush-8:0-1170 [004] 211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32 > > flush-8:0-1170 [004] 211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-8:0-1170 [004] 211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0 > > flush-8:0-1170 [004] 211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-8:0-1170 [004] 211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > > > flush-8:0-1170 [004] 213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447 > > flush-8:0-1170 [004] 213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32 > > flush-8:0-1170 [004] 213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-8:0-1170 [004] 213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0 > > flush-8:0-1170 [004] 213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-8:0-1170 [004] 213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > > > flush-8:0-1170 [004] 217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447 > > flush-8:0-1170 [004] 217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32 > > flush-8:0-1170 [004] 217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-8:0-1170 [004] 217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0 > > flush-8:0-1170 [004] 217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > It's still puzzling why dirty pages remain at 286 and does not get > cleaned by either flusher threads for local XFS and NFSROOT for so > long time.. I was looking at this as well. So the reason why pages were not cleaned by the flusher thread is that there were 2 dirty inodes and the inode with dirty pages had i_dirtied_whan newer than the time when we started this background writeback. Thus the running background writeback work always included only the other inode which has no dirty pages but I_DIRTY_SYNC set. Apparently XFS is stubborn and refuses to write the inode although we try rather hard. That is probably because dd writing to this inode is stuck in balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior but what can we do... I think the patch you suggest in the other email does not fix the above scenario (although it is useful for reducing latency so I'll include it - thanks for it!). Probably you were just lucky enough not to hit it in your next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when we do not make any progress with writeback. Thus we allow freshly dirtied inodes to be queued when we cannot make progress with the current set of inodes. The resulting patch is attached. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR
>From b85b7cdaf5fafec2d850304c1d0b1813c1c122a3 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@xxxxxxx> Date: Thu, 8 Sep 2011 01:05:25 +0200 Subject: [PATCH 1/2] writeback: Improve busyloop prevention Writeback of an inode can be stalled by things like internal fs locks being held. So in case we didn't write anything during a pass through b_io list, just wait for a moment and try again. Also allow newly dirtied inodes to be queued during the retry. When retrying is fruitless for a long time, or we have some other work to do, we just stop current work to avoid blocking flusher thread. CC: Christoph Hellwig <hch@xxxxxxxxxxxxx> Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> Signed-off-by: Jan Kara <jack@xxxxxxx> --- fs/fs-writeback.c | 61 +++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 48 insertions(+), 13 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 04cf3b9..6e909a9 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb, unsigned long wb_start = jiffies; long nr_pages = work->nr_pages; unsigned long oldest_jif; - struct inode *inode; long progress; + long pause = 1; + long max_pause = dirty_writeback_interval ? + msecs_to_jiffies(dirty_writeback_interval * 10) : + HZ; oldest_jif = jiffies; work->older_than_this = &oldest_jif; @@ -730,11 +733,19 @@ static long wb_writeback(struct bdi_writeback *wb, if (work->for_background && !over_bground_thresh()) break; + /* + * Refresh oldest timestamp. In kupdate style we include + * freshly expired inodes in each round. For other types of + * writeback we don't want to include newly dirtied inodes + * to avoid livelocking. But in case we made no progress in + * the last writeback round, allow new inodes to be queued + * so that we don't block flusher thread unnecessarily. + */ if (work->for_kupdate) { oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); - work->older_than_this = &oldest_jif; - } + } else if (pause > 1) + oldest_jif = jiffies; trace_writeback_start(wb->bdi, work); if (list_empty(&wb->b_io)) @@ -755,25 +766,49 @@ static long wb_writeback(struct bdi_writeback *wb, * mean the overall work is done. So we keep looping as long * as made some progress on cleaning pages or inodes. */ - if (progress) + if (progress) { + pause = 1; continue; + } /* * No more inodes for IO, bail */ if (list_empty(&wb->b_more_io)) break; /* - * Nothing written. Wait for some inode to - * become available for writeback. Otherwise - * we'll just busyloop. + * Nothing written (some internal fs locks were unavailable or + * inode was under writeback from balance_dirty_pages() or + * similar conditions). + * + * Wait for a while to avoid busylooping unless we waited for + * so long it does not make sense to retry anymore. */ - if (!list_empty(&wb->b_more_io)) { - trace_writeback_wait(wb->bdi, work); - inode = wb_inode(wb->b_more_io.prev); - spin_lock(&inode->i_lock); - inode_wait_for_writeback(inode, wb); - spin_unlock(&inode->i_lock); + if (pause > max_pause) + break; + /* + * Set state here to prevent races with someone waking us up + * (because new work is queued or because background limit is + * exceeded). + */ + set_current_state(TASK_INTERRUPTIBLE); + /* If there's some other work to do, proceed with it... */ + if (!list_empty(&wb->bdi->work_list) || + (!work->for_background && over_bground_thresh())) { + __set_current_state(TASK_RUNNING); + break; } + trace_writeback_wait(wb->bdi, work); + spin_unlock(&wb->list_lock); + /* + * Clear writeback_running so that we properly indicate that + * writeback is currently stalled + */ + clear_bit(BDI_writeback_running, &wb->bdi->state); + schedule_timeout(pause); + set_bit(BDI_writeback_running, &wb->bdi->state); + if (pause < max_pause) + pause <<= 1; + spin_lock(&wb->list_lock); } spin_unlock(&wb->list_lock); -- 1.7.1