On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote: > Switch inode dirty tracking lists to be per superblock instead of per > bdi. This is a major step towards filesystems being able to do their > own dirty tracking and selection of inodes for writeback if they desire > so (e.g. because they journal or COW data and need to writeback inodes > & pages in a specific order unknown to generic writeback code). > > Per superblock dirty lists also make selecting inodes for writeback > somewhat simpler because we don't have to search for inodes from a > particular superblock for some kinds of writeback (OTOH we pay for this > by having to iterate through superblocks for all-bdi type of writeback) > and this simplification will allow for an easier switch to a better > scaling data structure for dirty inodes. I think the WB_STATE_STALLED code is buggy w.r.t. unmount. > @@ -672,6 +670,15 @@ static long writeback_inodes(struct bdi_writeback *wb, > break; > } > } > + > + /* > + * In case we made no progress in current IO batch and there are no > + * inodes postponed for further writeback, set WB_STATE_STALLED > + * so that flusher doesn't busyloop in case no dirty inodes can be > + * written. > + */ > + if (!wrote && list_empty(&wb->b_more_io)) > + wb->state |= WB_STATE_STALLED; > spin_unlock(&wb->list_lock); Last background writeback ends with WB_STATE_STALLED. > @@ -771,26 +778,47 @@ static long bdi_writeback(struct backing_dev_info *bdi, > } else if (work->for_background) > oldest_jif = jiffies; > > + /* > + * If we made some progress, clear stalled state to retry other > + * writeback queues as well. > + */ > + if (progress) { > + spin_lock_bh(&bdi->wb_lock); > + list_for_each_entry(wb, &bdi->wq_list, bdi_list) { > + wb->state &= ~WB_STATE_STALLED; > + } > + spin_unlock_bh(&bdi->wb_lock); > + } First time through we clear the stalled state by walking &bdi->wq_list, but.... > + > + if (work->sb) { > + wb = &work->sb->s_dirty_inodes; > + if (wb->state & WB_STATE_STALLED) > + wb = NULL; if the sb state is stalled we don't do writeback, and .... > @@ -1015,6 +1017,13 @@ void kill_block_super(struct super_block *sb) > struct block_device *bdev = sb->s_bdev; > fmode_t mode = sb->s_mode; > > + /* > + * Unregister superblock from periodic writeback. There may be > + * writeback still running for it but we call sync_filesystem() later > + * and that will execute only after any background writeback is stopped. > + * This guarantees flusher won't touch sb that's going away. > + */ > + bdi_writeback_queue_unregister(&sb->s_dirty_inodes); > bdev->bd_super = NULL; > generic_shutdown_super(sb); We unregister the writeback queue from the BDI before unmount runs sync_filesystem() from geneic_shutdown_super(sb), and .... > +/* > + * Unregister writeback queue from BDI. No further background writeback will be > + * started against this superblock. However note that there may be writeback > + * still running for the sb. > + */ > +void bdi_writeback_queue_unregister(struct bdi_writeback *wb_queue) > +{ > + struct backing_dev_info *bdi = wb_bdi(wb_queue); > + > + /* Make sure flusher cannot find the superblock any longer */ > + spin_lock_bh(&bdi->wb_lock); > + list_del_init(&wb_queue->bdi_list); > + spin_unlock_bh(&bdi->wb_lock); > } Unregistering the BDI removes it from the BDI list and hence bdi_writeback will never clear the WB_STATE_STALLED bit on superblocks trying to do writeback in unmount. I'm not sure I really like this code very much - it seems to be muchmore complex than it needs to be because writeback is still managed on a per-bdi basis and the sb iteration is pretty clunky. If we are moving to per-sb inode tracking, we should also move all the writeback management to per-sb as well. IMO, there's no good reason for keeping flusher threads per-bdi and then having to iterate per-sb just to do background/periodic writeback, and then have special cases for sb specific writeback that avoids the bdi per-sb looping. i.e. per-sb flush work executed by a bdi flusher thread makes a lot more sense than per-bdi flush work that iterates superblocks. So for the moment, I think this patch makes things worse rather than better. I'd much prefer to see a single series that moves from per-bdi tracking/writeback to per-sb tracking/writeback than to split the tracking/writeback changes and then have to support an weird, temporary, intermediate code base like this... Ignoring that, the hack below makes this patch work for me. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx writeback: clear WB_STATE_STALLED for sb specific writeback From: Dave Chinner <dchinner@xxxxxxxxxx> During unmount, the superblock has been removed from the bdi writeback list, and so never has the WB_STATE_STALLED flag cleared before writeback is attempted. hence it never does writeback because it sees this flag. Fix this by unconditionally clearing the flag if work->sb is set rather than iterating the bdi.... Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/fs-writeback.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e80d1b9..6d9cd0c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -780,12 +780,20 @@ static long bdi_writeback(struct backing_dev_info *bdi, /* * If we made some progress, clear stalled state to retry other - * writeback queues as well. + * writeback queues as well. Note that unmount can remove the + * wbqueue from the bdi before we get here, in which case we'll + * be flushing a specific superblock and hence we have to + * specifically clear the superblock stalled state. */ if (progress) { spin_lock_bh(&bdi->wb_lock); - list_for_each_entry(wb, &bdi->wq_list, bdi_list) { + if (work->sb) { + wb = &work->sb->s_dirty_inodes; wb->state &= ~WB_STATE_STALLED; + } else { + list_for_each_entry(wb, &bdi->wq_list, bdi_list) { + wb->state &= ~WB_STATE_STALLED; + } } spin_unlock_bh(&bdi->wb_lock); } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html