Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 12, 2016 at 01:35:38PM -0500, Brian Foster wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Signed-off-by: Josef Bacik <jbacik@xxxxxx>
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
>  fs/block_dev.c            |   1 +
>  fs/fs-writeback.c         | 139 +++++++++++++++++++++++++++++++++++++---------
>  fs/inode.c                |   1 +
>  fs/super.c                |   2 +
>  include/linux/fs.h        |   4 ++
>  include/linux/writeback.h |   3 +
>  mm/page-writeback.c       |  19 +++++++
>  7 files changed, 144 insertions(+), 25 deletions(-)
> 
...
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 023f6a1..6821347 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
...
> @@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */

While the comment above is no longer relevant now that this doesn't
touch the bdi, I'm not totally sure whether the following few lines are
still necessary. I think this was associated with the lazy list removal
(to remove the inode on eviction), but I'd have to dig into that a bit
more to be sure...

Brian

> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		sb_clear_inode_writeback(inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> +	/*
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> +	 * We cannot acquire i_lock while the wblist lock is held without
> +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> +	 * local sync_list as well because inodes can be dropped from either
> +	 * list by writeback completion.
> +	 */
>  	mutex_lock(&sb->s_sync_lock);
> +
>  	spin_lock(&sb->s_inode_list_lock);
> +	spin_lock_irq(&sb->s_inode_wblist_lock);
> +	list_splice_init(&sb->s_inodes_wb, &sync_list);
>  
> -	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> -	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		list_del_init(&inode->i_wb_list);
> +
> +		/*
> +		 * The mapping can be cleaned before we wait on the inode since
> +		 * we do not have the mapping lock.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> +			continue;
> +
> +		spin_unlock_irq(&sb->s_inode_wblist_lock);
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>  			spin_unlock(&inode->i_lock);
> +
> +			spin_lock_irq(&sb->s_inode_wblist_lock);
>  			continue;
>  		}
>  		__iget(inode);
> @@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb)
>  		spin_unlock(&sb->s_inode_list_lock);
>  
>  		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> -
> -		/*
>  		 * We keep the error status of individual mapping so that
>  		 * applications can catch the writeback error using fsync(2).
>  		 * See filemap_fdatawait_keep_errors() for details.
>  		 */
>  		filemap_fdatawait_keep_errors(mapping);
> -
>  		cond_resched();
>  
> +		/*
> +		 * The inode has been written back. Check whether we still have
> +		 * pages under I/O and move the inode back to the primary list
> +		 * if necessary. sb_mark_inode_writeback() might not have done
> +		 * anything if the writeback tag hadn't been cleared from the
> +		 * mapping by the time more wb had started.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			spin_lock(&sb->s_inode_wblist_lock);
> +			if (list_empty(&inode->i_wb_list))
> +				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +			spin_unlock(&sb->s_inode_wblist_lock);
> +		} else
> +			WARN_ON(!list_empty(&inode->i_wb_list));
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		iput(inode);
> +
>  		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock_irq(&sb->s_inode_wblist_lock);
>  	}
> +
> +	spin_unlock_irq(&sb->s_inode_wblist_lock);
>  	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 5bb85a0..c7a9581 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/fs/super.c b/fs/super.c
> index 954aeb8..86822b8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	mutex_init(&s->s_sync_lock);
>  	INIT_LIST_HEAD(&s->s_inodes);
>  	spin_lock_init(&s->s_inode_list_lock);
> +	INIT_LIST_HEAD(&s->s_inodes_wb);
> +	spin_lock_init(&s->s_inode_wblist_lock);
>  
>  	if (list_lru_init_memcg(&s->s_dentry_lru))
>  		goto fail;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ef3cd36..aef01c7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -648,6 +648,7 @@ struct inode {
>  #endif
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> @@ -1374,6 +1375,9 @@ struct super_block {
>  	/* s_inode_list_lock protects s_inodes */
>  	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
>  	struct list_head	s_inodes;	/* all inodes */
> +
> +	spinlock_t		s_inode_wblist_lock;
> +	struct list_head	s_inodes_wb;	/* writeback inodes */
>  };
>  
>  extern struct timespec current_fs_time(struct super_block *sb);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b333c94..90a380c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
>  
>  void account_page_redirty(struct page *page);
>  
> +void sb_mark_inode_writeback(struct inode *inode);
> +void sb_clear_inode_writeback(struct inode *inode);
> +
>  #endif		/* WRITEBACK_H */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d15d88c..dcd4e2b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page)
>  				__wb_writeout_inc(wb);
>  			}
>  		}
> +
> +		if (mapping->host && !mapping_tagged(mapping,
> +						     PAGECACHE_TAG_WRITEBACK))
> +			sb_clear_inode_writeback(mapping->host);
> +
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
>  		ret = TestClearPageWriteback(page);
> @@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().
> +			 */
> +			if (!on_wblist && mapping->host)
> +				sb_mark_inode_writeback(mapping->host);
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux