On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote: > On Fri 07-10-11 21:43:47, Wu Fengguang wrote: > > > Great, thanks for review! I'll resend the two patches to Christoph so > > > that he can try them. > > > > Jan, I'd like to test out your updated patches with my stupid dd > > workloads. Would you (re)send them publicly? > Ah, I resent them publicly on Wednesday > (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email > apparently does not include emails from Acked-by into list of recipients so > you didn't get them. Sorry for that. The patches are attached for your > convenience. OK thanks. I only checked the linux-fsdevel list before asking.. The results should be ready tomorrow. Thanks, Fengguang > From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@xxxxxxx> > Date: Thu, 8 Sep 2011 01:05:25 +0200 > Subject: [PATCH 1/2] writeback: Improve busyloop prevention > > Writeback of an inode can be stalled by things like internal fs locks being > held. So in case we didn't write anything during a pass through b_io list, > just wait for a moment and try again. > > CC: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/fs-writeback.c | 26 ++++++++++++++------------ > 1 files changed, 14 insertions(+), 12 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 04cf3b9..bdeb26a 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb, > unsigned long wb_start = jiffies; > long nr_pages = work->nr_pages; > unsigned long oldest_jif; > - struct inode *inode; > long progress; > + long pause = 1; > > oldest_jif = jiffies; > work->older_than_this = &oldest_jif; > @@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb, > * mean the overall work is done. So we keep looping as long > * as made some progress on cleaning pages or inodes. > */ > - if (progress) > + if (progress) { > + pause = 1; > continue; > + } > /* > * No more inodes for IO, bail > */ > if (list_empty(&wb->b_more_io)) > break; > /* > - * Nothing written. Wait for some inode to > - * become available for writeback. Otherwise > - * we'll just busyloop. > + * Nothing written (some internal fs locks were unavailable or > + * inode was under writeback from balance_dirty_pages() or > + * similar conditions). Wait for a while to avoid busylooping. > */ > - if (!list_empty(&wb->b_more_io)) { > - trace_writeback_wait(wb->bdi, work); > - inode = wb_inode(wb->b_more_io.prev); > - spin_lock(&inode->i_lock); > - inode_wait_for_writeback(inode, wb); > - spin_unlock(&inode->i_lock); > - } > + trace_writeback_wait(wb->bdi, work); > + spin_unlock(&wb->list_lock); > + __set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(pause); > + if (pause < HZ / 10) > + pause <<= 1; > + spin_lock(&wb->list_lock); > } > spin_unlock(&wb->list_lock); > > -- > 1.7.1 > > From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@xxxxxxx> > Date: Thu, 8 Sep 2011 01:46:42 +0200 > Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() > > Calling redirty_tail() can put off inode writeback for upto 30 seconds (or > whatever dirty_expire_centisecs is). This is unnecessarily big delay in some > cases and in other cases it is a really bad thing. In particular XFS tries to > be nice to writeback and when ->write_inode is called for an inode with locked > ilock, it just redirties the inode and returns EAGAIN. That currently causes > writeback_single_inode() to redirty_tail() the inode. As contended ilock is > common thing with XFS while extending files the result can be that inode > writeout is put off for a really long time. > > Now that we have more robust busyloop prevention in wb_writeback() we can > call requeue_io() in cases where quick retry is required without fear of > raising CPU consumption too much. > > CC: Christoph Hellwig <hch@xxxxxxxxxxxxx> > Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/fs-writeback.c | 61 ++++++++++++++++++++++++---------------------------- > 1 files changed, 28 insertions(+), 33 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index bdeb26a..c786023 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, > long nr_to_write = wbc->nr_to_write; > unsigned dirty; > int ret; > + bool inode_written = false; > > assert_spin_locked(&wb->list_lock); > assert_spin_locked(&inode->i_lock); > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, > /* Don't write the inode if only I_DIRTY_PAGES was set */ > if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { > int err = write_inode(inode, wbc); > + if (!err) > + inode_written = true; > if (ret == 0) > ret = err; > } > @@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, > if (!(inode->i_state & I_FREEING)) { > /* > * Sync livelock prevention. Each inode is tagged and synced in > - * one shot. If still dirty, it will be redirty_tail()'ed below. > - * Update the dirty time to prevent enqueue and sync it again. > + * one shot. If still dirty, update dirty time and put it back > + * to dirty list to prevent enqueue and syncing it again. > */ > if ((inode->i_state & I_DIRTY) && > - (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) > + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) { > inode->dirtied_when = jiffies; > - > - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > + redirty_tail(inode, wb); > + } else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > /* > - * We didn't write back all the pages. nfs_writepages() > - * sometimes bales out without doing anything. > + * We didn't write back all the pages. nfs_writepages() > + * sometimes bales out without doing anything or we > + * just run our of our writeback slice. > */ > inode->i_state |= I_DIRTY_PAGES; > - if (wbc->nr_to_write <= 0) { > - /* > - * slice used up: queue for next turn > - */ > - requeue_io(inode, wb); > - } else { > - /* > - * Writeback blocked by something other than > - * congestion. Delay the inode for some time to > - * avoid spinning on the CPU (100% iowait) > - * retrying writeback of the dirty page/inode > - * that cannot be performed immediately. > - */ > - redirty_tail(inode, wb); > - } > + requeue_io(inode, wb); > } else if (inode->i_state & I_DIRTY) { > /* > * Filesystems can dirty the inode during writeback > * operations, such as delayed allocation during > * submission or metadata updates after data IO > - * completion. > + * completion. Also inode could have been dirtied by > + * some process aggressively touching metadata. > + * Finally, filesystem could just fail to write the > + * inode for some reason. We have to distinguish the > + * last case from the previous ones - in the last case > + * we want to give the inode quick retry, in the > + * other cases we want to put it back to the dirty list > + * to avoid livelocking of writeback. > */ > - redirty_tail(inode, wb); > + if (inode_written) > + redirty_tail(inode, wb); > + else > + requeue_io(inode, wb); > } else { > /* > * The inode is clean. At this point we either have > @@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb, > wrote++; > if (wbc.pages_skipped) { > /* > - * writeback is not making progress due to locked > - * buffers. Skip this inode for now. > + * Writeback is not making progress due to unavailable > + * fs locks or similar condition. Retry in next round. > */ > - redirty_tail(inode, wb); > + requeue_io(inode, wb); > } > spin_unlock(&inode->i_lock); > spin_unlock(&wb->list_lock); > @@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, > struct super_block *sb = inode->i_sb; > > if (!grab_super_passive(sb)) { > - /* > - * grab_super_passive() may fail consistently due to > - * s_umount being grabbed by someone else. Don't use > - * requeue_io() to avoid busy retrying the inode/sb. > - */ > - redirty_tail(inode, wb); > + requeue_io(inode, wb); > continue; > } > wrote += writeback_sb_inodes(sb, wb, work); > -- > 1.7.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html