Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync

Jan Kara <jack@xxxxxxx> · Wed, 13 Jan 2016 11:43:02 +0100

On Tue 12-01-16 13:35:38, Brian Foster wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Signed-off-by: Josef Bacik <jbacik@xxxxxx>
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>

...

>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))

The condition and the comment can go away now. We no longer need to call
inode_to_bdi()...

> +		sb_clear_inode_writeback(inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> +	/*
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> +	 * We cannot acquire i_lock while the wblist lock is held without
> +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> +	 * local sync_list as well because inodes can be dropped from either
> +	 * list by writeback completion.
> +	 */
>  	mutex_lock(&sb->s_sync_lock);
> +
>  	spin_lock(&sb->s_inode_list_lock);

So I'm not very happy that we have to hold s_inode_list_lock here. After
all reducing contention on that lock was one of the main motivations of
this patch. That being said I think the locking is correct now which is a
good start :) and we can work from here.

As a future improvement I think we could use RCU to grab inode reference
safely like:

Hold rcu_read_lock() instead of s_inode_list_lock. That guarantees that
memory for the inode we have reference to is not freed. So we can then:

	spin_unlock_irq(&sb->s_inode_wblist_lock);
	spin_lock(&inode->i_lock);
	if (inode->i_flags & (I_FREEING | I_WILL_FREE | I_NEW))
		avoid touching inode
	__iget(inode);
	spin_unlock(&inode->i_lock);
	rcu_read_unlock();

But I'd do this as a separate patch and run this through Al to verify...

> +	spin_lock_irq(&sb->s_inode_wblist_lock);
> +	list_splice_init(&sb->s_inodes_wb, &sync_list);
>  
> -	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> -	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		list_del_init(&inode->i_wb_list);

What if we just did:

		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);

here instead of list_del_init()? That way we would always maintain
consistency:

 PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list)

under s_inode_wblist_lock and thus you could just delete that list
shuffling below...

> +		/*
> +		 * The inode has been written back. Check whether we still have
> +		 * pages under I/O and move the inode back to the primary list
> +		 * if necessary. sb_mark_inode_writeback() might not have done
> +		 * anything if the writeback tag hadn't been cleared from the
> +		 * mapping by the time more wb had started.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			spin_lock(&sb->s_inode_wblist_lock);
> +			if (list_empty(&inode->i_wb_list))
> +				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +			spin_unlock(&sb->s_inode_wblist_lock);
> +		} else
> +			WARN_ON(!list_empty(&inode->i_wb_list));
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		iput(inode);
> +
>  		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock_irq(&sb->s_inode_wblist_lock);
>  	}
> +
> +	spin_unlock_irq(&sb->s_inode_wblist_lock);
>  	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  

...

> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().

The comment about lazy removal is not true anymore...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html