On Tue, Jan 12, 2016 at 01:35:38PM -0500, Brian Foster wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > wait_sb_inodes() currently does a walk of all inodes in the > filesystem to find dirty one to wait on during sync. This is highly > inefficient and wastes a lot of CPU when there are lots of clean > cached inodes that we don't need to wait on. > > To avoid this "all inode" walk, we need to track inodes that are > currently under writeback that we need to wait for. We do this by > adding inodes to a writeback list on the sb when the mapping is > first tagged as having pages under writeback. wait_sb_inodes() can > then walk this list of "inodes under IO" and wait specifically just > for the inodes that the current sync(2) needs to wait for. > > Define a couple helpers to add/remove an inode from the writeback > list and call them when the overall mapping is tagged for or cleared > from writeback. Update wait_sb_inodes() to walk only the inodes > under writeback due to the sync. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > Signed-off-by: Josef Bacik <jbacik@xxxxxx> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/block_dev.c | 1 + > fs/fs-writeback.c | 139 +++++++++++++++++++++++++++++++++++++--------- > fs/inode.c | 1 + > fs/super.c | 2 + > include/linux/fs.h | 4 ++ > include/linux/writeback.h | 3 + > mm/page-writeback.c | 19 +++++++ > 7 files changed, 144 insertions(+), 25 deletions(-) > ... > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 023f6a1..6821347 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c ... > @@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode) > } > > /* > - * Wait for writeback on an inode to complete. Caller must have inode pinned. > + * Wait for writeback on an inode to complete during eviction. Caller must have > + * inode pinned. > */ > void inode_wait_for_writeback(struct inode *inode) > { > + BUG_ON(!(inode->i_state & I_FREEING)); > + > spin_lock(&inode->i_lock); > __inode_wait_for_writeback(inode); > spin_unlock(&inode->i_lock); > + > + /* > + * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for > + * the inode and then deref bdev->bd_disk, which at this point has been > + * set to NULL, so we would panic. At the point we are dropping our > + * bd_inode we won't have any pages under writeback on the device so > + * this is safe. But just in case we'll assert to make sure we don't > + * screw this up. > + */ While the comment above is no longer relevant now that this doesn't touch the bdi, I'm not totally sure whether the following few lines are still necessary. I think this was associated with the lazy list removal (to remove the inode on eviction), but I'd have to dig into that a bit more to be sure... Brian > + if (!sb_is_blkdev_sb(inode->i_sb)) > + sb_clear_inode_writeback(inode); > + BUG_ON(!list_empty(&inode->i_wb_list)); > } > > /* > @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty); > */ > static void wait_sb_inodes(struct super_block *sb) > { > - struct inode *inode, *old_inode = NULL; > + LIST_HEAD(sync_list); > > /* > * We need to be protected against the filesystem going from > @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb) > */ > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > + /* > + * Data integrity sync. Must wait for all pages under writeback, because > + * there may have been pages dirtied before our sync call, but which had > + * writeout started before we write it out. In which case, the inode > + * may not be on the dirty list, but we still have to wait for that > + * writeout. > + * > + * To avoid syncing inodes put under IO after we have started here, > + * splice the io list to a temporary list head and walk that. Newly > + * dirtied inodes will go onto the primary list so we won't wait for > + * them. This is safe to do as we are serialised by the s_sync_lock, > + * so we'll complete processing the complete list before the next > + * sync operation repeats the splice-and-walk process. > + * > + * s_inode_wblist_lock protects the wb list and is irq-safe as it is > + * acquired inside of the mapping lock by __test_set_page_writeback(). > + * We cannot acquire i_lock while the wblist lock is held without > + * introducing irq inversion issues. Since s_inodes_wb is a subset of > + * s_inodes, use s_inode_list_lock to prevent inodes from disappearing > + * until we have a reference. Note that s_inode_wblist_lock protects the > + * local sync_list as well because inodes can be dropped from either > + * list by writeback completion. > + */ > mutex_lock(&sb->s_sync_lock); > + > spin_lock(&sb->s_inode_list_lock); > + spin_lock_irq(&sb->s_inode_wblist_lock); > + list_splice_init(&sb->s_inodes_wb, &sync_list); > > - /* > - * Data integrity sync. Must wait for all pages under writeback, > - * because there may have been pages dirtied before our sync > - * call, but which had writeout started before we write it out. > - * In which case, the inode may not be on the dirty list, but > - * we still have to wait for that writeout. > - */ > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + while (!list_empty(&sync_list)) { > + struct inode *inode = list_first_entry(&sync_list, struct inode, > + i_wb_list); > struct address_space *mapping = inode->i_mapping; > > + list_del_init(&inode->i_wb_list); > + > + /* > + * The mapping can be cleaned before we wait on the inode since > + * we do not have the mapping lock. > + */ > + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) > + continue; > + > + spin_unlock_irq(&sb->s_inode_wblist_lock); > + > spin_lock(&inode->i_lock); > - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - (mapping->nrpages == 0)) { > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { > spin_unlock(&inode->i_lock); > + > + spin_lock_irq(&sb->s_inode_wblist_lock); > continue; > } > __iget(inode); > @@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb) > spin_unlock(&sb->s_inode_list_lock); > > /* > - * We hold a reference to 'inode' so it couldn't have been > - * removed from s_inodes list while we dropped the > - * s_inode_list_lock. We cannot iput the inode now as we can > - * be holding the last reference and we cannot iput it under > - * s_inode_list_lock. So we keep the reference and iput it > - * later. > - */ > - iput(old_inode); > - old_inode = inode; > - > - /* > * We keep the error status of individual mapping so that > * applications can catch the writeback error using fsync(2). > * See filemap_fdatawait_keep_errors() for details. > */ > filemap_fdatawait_keep_errors(mapping); > - > cond_resched(); > > + /* > + * The inode has been written back. Check whether we still have > + * pages under I/O and move the inode back to the primary list > + * if necessary. sb_mark_inode_writeback() might not have done > + * anything if the writeback tag hadn't been cleared from the > + * mapping by the time more wb had started. > + */ > + spin_lock_irq(&mapping->tree_lock); > + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { > + spin_lock(&sb->s_inode_wblist_lock); > + if (list_empty(&inode->i_wb_list)) > + list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); > + spin_unlock(&sb->s_inode_wblist_lock); > + } else > + WARN_ON(!list_empty(&inode->i_wb_list)); > + spin_unlock_irq(&mapping->tree_lock); > + > + iput(inode); > + > spin_lock(&sb->s_inode_list_lock); > + spin_lock_irq(&sb->s_inode_wblist_lock); > } > + > + spin_unlock_irq(&sb->s_inode_wblist_lock); > spin_unlock(&sb->s_inode_list_lock); > - iput(old_inode); > + > mutex_unlock(&sb->s_sync_lock); > } > > diff --git a/fs/inode.c b/fs/inode.c > index 5bb85a0..c7a9581 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode) > INIT_HLIST_NODE(&inode->i_hash); > INIT_LIST_HEAD(&inode->i_devices); > INIT_LIST_HEAD(&inode->i_io_list); > + INIT_LIST_HEAD(&inode->i_wb_list); > INIT_LIST_HEAD(&inode->i_lru); > address_space_init_once(&inode->i_data); > i_size_ordered_init(inode); > diff --git a/fs/super.c b/fs/super.c > index 954aeb8..86822b8 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) > mutex_init(&s->s_sync_lock); > INIT_LIST_HEAD(&s->s_inodes); > spin_lock_init(&s->s_inode_list_lock); > + INIT_LIST_HEAD(&s->s_inodes_wb); > + spin_lock_init(&s->s_inode_wblist_lock); > > if (list_lru_init_memcg(&s->s_dentry_lru)) > goto fail; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ef3cd36..aef01c7 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -648,6 +648,7 @@ struct inode { > #endif > struct list_head i_lru; /* inode LRU list */ > struct list_head i_sb_list; > + struct list_head i_wb_list; /* backing dev writeback list */ > union { > struct hlist_head i_dentry; > struct rcu_head i_rcu; > @@ -1374,6 +1375,9 @@ struct super_block { > /* s_inode_list_lock protects s_inodes */ > spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp; > struct list_head s_inodes; /* all inodes */ > + > + spinlock_t s_inode_wblist_lock; > + struct list_head s_inodes_wb; /* writeback inodes */ > }; > > extern struct timespec current_fs_time(struct super_block *sb); > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index b333c94..90a380c 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping, > > void account_page_redirty(struct page *page); > > +void sb_mark_inode_writeback(struct inode *inode); > +void sb_clear_inode_writeback(struct inode *inode); > + > #endif /* WRITEBACK_H */ > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index d15d88c..dcd4e2b 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page) > __wb_writeout_inc(wb); > } > } > + > + if (mapping->host && !mapping_tagged(mapping, > + PAGECACHE_TAG_WRITEBACK)) > + sb_clear_inode_writeback(mapping->host); > + > spin_unlock_irqrestore(&mapping->tree_lock, flags); > } else { > ret = TestClearPageWriteback(page); > @@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > spin_lock_irqsave(&mapping->tree_lock, flags); > ret = TestSetPageWriteback(page); > if (!ret) { > + bool on_wblist; > + > + on_wblist = mapping_tagged(mapping, > + PAGECACHE_TAG_WRITEBACK); > + > radix_tree_tag_set(&mapping->page_tree, > page_index(page), > PAGECACHE_TAG_WRITEBACK); > if (bdi_cap_account_writeback(bdi)) > __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); > + > + /* > + * we can come through here when swapping anonymous > + * pages, so we don't necessarily have an inode to > + * track for sync. Inodes are remove lazily, so there is > + * no equivalent in test_clear_page_writeback(). > + */ > + if (!on_wblist && mapping->host) > + sb_mark_inode_writeback(mapping->host); > } > if (!PageDirty(page)) > radix_tree_tag_clear(&mapping->page_tree, > -- > 2.4.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html