Hi Ian, On Tue, Mar 24, 2009 at 11:04:11PM +0800, Ian Kent wrote: > Jeff Layton wrote: > > On Tue, 24 Mar 2009 10:28:06 -0400 > > Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > >> On Tue, 24 Mar 2009 21:57:20 +0800 > >> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > >> > >>> Hi Jeff, > >>> > >>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote: > >>>> This may be a problem on other filesystems too, but the reproducer I > >>>> have involves NFS. > >>>> > >>>> On NFS, the __mark_inode_dirty() call after writing back the inode is > >>>> done in the rpc_release handler for COMMIT calls. This call is done > >>>> asynchronously after the call completes. > >>>> > >>>> Because there's no real coordination between __mark_inode_dirty() and > >>>> __sync_single_inode(), it's often the case that these two calls will > >>>> race and __mark_inode_dirty() will get called while I_SYNC is still set. > >>>> When this happens, __sync_single_inode() should detect that the inode > >>>> was redirtied while we were flushing it and call redirty_tail() to put > >>>> it back on the s_dirty list. > >>>> > >>>> When redirty_tail() puts it back on the list, it only resets the > >>>> dirtied_when value if it's necessary to maintain the list order. Given > >>>> the right situation (the right I/O patterns and a lot of luck), this > >>>> could result in dirtied_when never getting updated on an inode that's > >>>> constantly being redirtied while pdflush is writing it back. > >>>> > >>>> Since dirtied_when is based on jiffies, it's possible for it to persist > >>>> across 2 sign-bit flips of jiffies. When that happens, the time_after() > >>>> check in sync_sb_inodes no longer works correctly and writeouts by > >>>> pdflush of this inode and any inodes after it on the list stop. > >>>> > >>>> This patch fixes this by resetting the dirtied_when value on an inode > >>>> when we're adding it back onto an empty s_dirty list. Since we generally > >>>> write inodes from oldest to newest dirtied_when values, this has the > >>>> effect of making it so that these inodes don't end up with dirtied_when > >>>> values that are frozen. > >>>> > >>>> I've also taken the liberty of fixing up the comments a bit and changed > >>>> the !time_after_eq() check in redirty_tail to be time_before(). That > >>>> should be functionally equivalent but I think it's more readable. > >>>> > >>>> I wish this were just a theoretical problem, but we've had a customer > >>>> hit a variant of it in an older kernel. Newer upstream kernels have a > >>>> number of changes that make this problem less likely. As best I can tell > >>>> though, there is nothing that really prevents it. > >>>> > >>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > >>>> --- > >>>> fs/fs-writeback.c | 22 +++++++++++++++++----- > >>>> 1 files changed, 17 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > >>>> index e3fe991..bd2a7ff 100644 > >>>> --- a/fs/fs-writeback.c > >>>> +++ b/fs/fs-writeback.c > >>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync) > >>>> * furthest end of its superblock's dirty-inode list. > >>>> * > >>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is > >>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is > >>>> - * the case then the inode must have been redirtied while it was being written > >>>> - * out and we don't reset its dirtied_when. > >>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty > >>>> + * list. If that is the case then we don't need to restamp it to maintain the > >>>> + * order of the list. > >>>> + * > >>>> + * If s_dirty is empty however, then we need to go ahead and update > >>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are > >>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if > >>>> + * they happen to consistently be the first one to go back on the list. > >>>> + * > >>>> + * Since we're using jiffies values in that field, letting dirtied_when grow > >>>> + * too old will be problematic if jiffies wraps. It may also be causing > >>>> + * pdflush to flush the inode too often since it'll always look like it was > >>>> + * dirtied a long time ago. > >>>> */ > >>>> static void redirty_tail(struct inode *inode) > >>>> { > >>>> struct super_block *sb = inode->i_sb; > >>>> > >>>> - if (!list_empty(&sb->s_dirty)) { > >>>> + if (list_empty(&sb->s_dirty)) { > >>>> + inode->dirtied_when = jiffies; > >>>> + } else { > >>>> struct inode *tail_inode; > >>>> > >>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list); > >>>> - if (!time_after_eq(inode->dirtied_when, > >>>> + if (time_before(inode->dirtied_when, > >>>> tail_inode->dirtied_when)) > >>>> inode->dirtied_when = jiffies; > >>>> } > >>> I'm afraid you patch is equivalent to the following one. > >>> Because once the first inode's dirtied_when is set to jiffies, > >>> in order to keep the list in order, the following ones (mostly) > >>> will also be updated. A domino effect. > >>> > >>> Thanks, > >>> Fengguang > >>> > >> Good point. One of our other engineers proposed a similar patch > >> originally. I considered it but wasn't clear whether there could be a > >> situation where unconditionally resetting dirtied_when would be a > >> problem. Now that I think about it though, I think you're right... > >> > >> So maybe something like the patch below is the right thing to do? Or, > >> maybe when we believe that the inode was fully cleaned and then > >> redirtied, we'd just unconditionally stamp dirtied_when. Something like > >> this maybe? > >> > >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > >> index bd2a7ff..596c96e 100644 > >> --- a/fs/fs-writeback.c > >> +++ b/fs/fs-writeback.c > >> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc) > >> * Someone redirtied the inode while were writing back > >> * the pages. > >> */ > >> - redirty_tail(inode); > >> + inode->dirtied_when = jiffies; > >> + list_move(&inode->i_list, &sb->s_dirty); > >> } else if (atomic_read(&inode->i_count)) { > >> /* > >> * The inode is clean, inuse > > > > Hmm...though it is still possible that you could consistently race in > > such a way that after writepages(), I_DIRTY is never set but the > > PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back > > to the same problem of a stuck dirtied_when value. > > > > So maybe someone can explain to me why we take such great pains to > > preserve the dirtied_when value when we're putting the inode back on > > the tail of s_dirty? Why not just unconditionally reset it? > > I think that redirty_tail() is the best place for this as it is a > central location where dirtied_when can be updated. Then all we have to > worry about is making sure it is called from all the locations needed. > > I'm not sure that removing the comment is a good idea (the Wu Fengguang > patch) but it probably needs to be revised to explain why dirtied_when > is forcing a rewrite of the list entry times. The comment basically says "we do not want to reset dirtied_when for inodes redirtied while being written out". I guess there are two intentions: - to retry its writeback as soon as possible and avoid long 30s delays; - to keep a faithful dirtied_when. The first one is best effort anyway, changing it should not create new bugs. The second one shall not, either. Due to the very limited use of dirtied_when. However, we do have another cheap solution that can retain both of the two original intentions. The main idea is to introduce a new s_more_io_wait queue, and convert the current redirty_tail() calls to either requeue_io_wait() or some completely_dirty_inode(). Thanks, Fengguang --- (a not-up-to-date patch) writeback: introduce super_block.s_more_io_wait Introduce super_block.s_more_io_wait to park inodes that for some reason cannot be synced immediately. They will be revisited in the next s_io enqueue time(<=5s). The new data flow after this patchset: s_dirty --> s_io --> s_more_io/s_more_io_wait --+ ^ | | | +----------------------------------+ - to fill s_io: s_more_io + s_dirty(expired) + s_more_io_wait ---> s_io - to drain s_io: s_io -+--> clean inodes goto inode_in_use/inode_unused | +--> s_more_io | +--> s_more_io_wait Obviously there're no ordering or starvation problems in the queues: - s_dirty is now a strict FIFO queue - inode.dirtied_when is only set when made dirty - once exipired, the dirty inode will stay in s_*io* queues until made clean - the dirty inodes in s_*io* will be revisted in order, hence small files won't be starved by big dirty files. Cc: David Chinner <dgc@xxxxxxx> Cc: Michael Rubin <mrubin@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Signed-off-by: Fengguang Wu <wfg@xxxxxxxxxxxxxxxx> --- fs/fs-writeback.c | 19 +++++++++++++++---- fs/super.c | 1 + include/linux/fs.h | 1 + 3 files changed, 17 insertions(+), 4 deletions(-) --- linux-mm.orig/fs/fs-writeback.c +++ linux-mm/fs/fs-writeback.c @@ -172,6 +172,14 @@ static void requeue_io(struct inode *ino list_move(&inode->i_list, &inode->i_sb->s_more_io); } +/* + * The inode should be retried after _sleeping_ for a while. + */ +static void requeue_io_wait(struct inode *inode) +{ + list_move(&inode->i_list, &inode->i_sb->s_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -200,7 +208,8 @@ static void move_expired_inodes(struct l /* * Queue all expired dirty inodes for io, eldest first: - * (entrance) => s_dirty inodes + * (entrance) => s_more_io_wait inodes + * => s_dirty inodes * => s_more_io inodes * => remaining inodes in s_io => (dequeue for sync) */ @@ -209,13 +218,15 @@ static void queue_io(struct super_block { list_splice_init(&sb->s_more_io, &sb->s_io); move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this); + list_splice_init(&sb->s_more_io_wait, &sb->s_io); } int sb_has_dirty_inodes(struct super_block *sb) { - return !list_empty(&sb->s_dirty) || - !list_empty(&sb->s_io) || - !list_empty(&sb->s_more_io); + return !list_empty(&sb->s_dirty) || + !list_empty(&sb->s_io) || + !list_empty(&sb->s_more_io) || + !list_empty(&sb->s_more_io_wait); } EXPORT_SYMBOL(sb_has_dirty_inodes); --- linux-mm.orig/fs/super.c +++ linux-mm/fs/super.c @@ -64,6 +64,7 @@ static struct super_block *alloc_super(s INIT_LIST_HEAD(&s->s_dirty); INIT_LIST_HEAD(&s->s_io); INIT_LIST_HEAD(&s->s_more_io); + INIT_LIST_HEAD(&s->s_more_io_wait); INIT_LIST_HEAD(&s->s_files); INIT_LIST_HEAD(&s->s_instances); INIT_HLIST_HEAD(&s->s_anon); --- linux-mm.orig/include/linux/fs.h +++ linux-mm/include/linux/fs.h @@ -1012,6 +1012,7 @@ struct super_block { struct list_head s_dirty; /* dirty inodes */ struct list_head s_io; /* parked for writeback */ struct list_head s_more_io; /* parked for more writeback */ + struct list_head s_more_io_wait; /* parked for sleep-then-retry */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_files; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html