Re: [PATCH 14/14] writeback: Per-sb dirty tracking

Jan Kara <jack@xxxxxxx> · Wed, 6 Aug 2014 10:46:39 +0200

On Wed 06-08-14 09:44:16, Dave Chinner wrote:
> On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote:
> > Switch inode dirty tracking lists to be per superblock instead of per
> > bdi. This is a major step towards filesystems being able to do their
> > own dirty tracking and selection of inodes for writeback if they desire
> > so (e.g. because they journal or COW data and need to writeback inodes
> > & pages in a specific order unknown to generic writeback code).
> > 
> > Per superblock dirty lists also make selecting inodes for writeback
> > somewhat simpler because we don't have to search for inodes from a
> > particular superblock for some kinds of writeback (OTOH we pay for this
> > by having to iterate through superblocks for all-bdi type of writeback)
> > and this simplification will allow for an easier switch to a better
> > scaling data structure for dirty inodes.
> 
> I think the WB_STATE_STALLED code is buggy w.r.t. unmount.
> 
> > @@ -672,6 +670,15 @@ static long writeback_inodes(struct bdi_writeback *wb,
> >  				break;
> >  		}
> >  	}
> > +
> > +	/*
> > +	 * In case we made no progress in current IO batch and there are no
> > +	 * inodes postponed for further writeback, set WB_STATE_STALLED
> > +	 * so that flusher doesn't busyloop in case no dirty inodes can be
> > +	 * written.
> > +	 */
> > +	if (!wrote && list_empty(&wb->b_more_io))
> > +		wb->state |= WB_STATE_STALLED;
> >  	spin_unlock(&wb->list_lock);
> 
> Last background writeback ends with WB_STATE_STALLED.
> 
> > @@ -771,26 +778,47 @@ static long bdi_writeback(struct backing_dev_info *bdi,
> >  		} else if (work->for_background)
> >  			oldest_jif = jiffies;
> >  
> > +		/*
> > +		 * If we made some progress, clear stalled state to retry other
> > +		 * writeback queues as well.
> > +		 */
> > +		if (progress) {
> > +			spin_lock_bh(&bdi->wb_lock);
> > +			list_for_each_entry(wb, &bdi->wq_list, bdi_list) {
> > +				wb->state &= ~WB_STATE_STALLED;
> > +			}
> > +			spin_unlock_bh(&bdi->wb_lock);
> > +		}
> 
> First time through we clear the stalled state by walking
> &bdi->wq_list, but....
> 
> > +
> > +		if (work->sb) {
> > +			wb = &work->sb->s_dirty_inodes;
> > +			if (wb->state & WB_STATE_STALLED)
> > +				wb = NULL;
> 
> if the sb state is stalled we don't do writeback, and ....
> 
> > @@ -1015,6 +1017,13 @@ void kill_block_super(struct super_block *sb)
> >  	struct block_device *bdev = sb->s_bdev;
> >  	fmode_t mode = sb->s_mode;
> >  
> > +	/*
> > +	 * Unregister superblock from periodic writeback. There may be
> > +	 * writeback still running for it but we call sync_filesystem() later
> > +	 * and that will execute only after any background writeback is stopped.
> > +	 * This guarantees flusher won't touch sb that's going away.
> > +	 */
> > +	bdi_writeback_queue_unregister(&sb->s_dirty_inodes);
> >  	bdev->bd_super = NULL;
> >  	generic_shutdown_super(sb);
> 
> We unregister the writeback queue from the BDI before unmount runs
> sync_filesystem() from geneic_shutdown_super(sb), and ....
> 
> > +/*
> > + * Unregister writeback queue from BDI. No further background writeback will be
> > + * started against this superblock. However note that there may be writeback
> > + * still running for the sb.
> > + */
> > +void bdi_writeback_queue_unregister(struct bdi_writeback *wb_queue)
> > +{
> > +	struct backing_dev_info *bdi = wb_bdi(wb_queue);
> > +
> > +	/* Make sure flusher cannot find the superblock any longer */
> > +	spin_lock_bh(&bdi->wb_lock);
> > +	list_del_init(&wb_queue->bdi_list);
> > +	spin_unlock_bh(&bdi->wb_lock);
> >  }
> 
> Unregistering the BDI removes it from the BDI list and hence
> bdi_writeback will never clear the WB_STATE_STALLED bit on
> superblocks trying to do writeback in unmount.
  Ah, well spotted!

> I'm not sure I really like this code very much - it seems to be
> muchmore complex than it needs to be because writeback is still
> managed on a per-bdi basis and the sb iteration is pretty clunky.
> If we are moving to per-sb inode tracking, we should also move all
> the writeback management to per-sb as well.
> 
> IMO, there's no good reason for keeping flusher threads per-bdi and
> then having to iterate per-sb just to do background/periodic
> writeback, and then have special cases for sb specific writeback
> that avoids the bdi per-sb looping. i.e. per-sb flush work executed
> by a bdi flusher thread makes a lot more sense than per-bdi
> flush work that iterates superblocks.
> 
> So for the moment, I think this patch makes things worse rather than
> better. I'd much prefer to see a single series that moves from per-bdi
> tracking/writeback to per-sb tracking/writeback than to split the
> tracking/writeback changes and then have to support an weird,
> temporary, intermediate code base like this...
  So when writing this series I was thinking about both possibilities -
i.e., keeping per-bdi threads and changing to per-sb threads. In the end
I've decided to start with keeping per-bdi threads and seeing how things
work out.

Regarding per-sb threads I have two unresolved questions:

1) How to handle block device inodes - when filesystem is mounted on the
block device, it would be natural to writeback such inode together with
other filesystem's inodes. When filesystem isn't mounted on the block
device we have no superblock to attach such inode to so we would have to
have some flusher for the virtual blkdev superblock that will deal with
specifics of block device superblock.

2) How to deal with multiple superblocks per device - and here I'm
convinced we should not regress writeback performance of the case where
disk is partitioned into several partitions. And I think this will require
some kind of synchronization between per-sb threads on the same device.

So overall I'm not convinced that per-sb threads will end up being simpler
than per-bdi threads. But we can try...

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html