On Tue 12-01-16 13:35:38, Brian Foster wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > wait_sb_inodes() currently does a walk of all inodes in the > filesystem to find dirty one to wait on during sync. This is highly > inefficient and wastes a lot of CPU when there are lots of clean > cached inodes that we don't need to wait on. > > To avoid this "all inode" walk, we need to track inodes that are > currently under writeback that we need to wait for. We do this by > adding inodes to a writeback list on the sb when the mapping is > first tagged as having pages under writeback. wait_sb_inodes() can > then walk this list of "inodes under IO" and wait specifically just > for the inodes that the current sync(2) needs to wait for. > > Define a couple helpers to add/remove an inode from the writeback > list and call them when the overall mapping is tagged for or cleared > from writeback. Update wait_sb_inodes() to walk only the inodes > under writeback due to the sync. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > Signed-off-by: Josef Bacik <jbacik@xxxxxx> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> ... > void inode_wait_for_writeback(struct inode *inode) > { > + BUG_ON(!(inode->i_state & I_FREEING)); > + > spin_lock(&inode->i_lock); > __inode_wait_for_writeback(inode); > spin_unlock(&inode->i_lock); > + > + /* > + * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for > + * the inode and then deref bdev->bd_disk, which at this point has been > + * set to NULL, so we would panic. At the point we are dropping our > + * bd_inode we won't have any pages under writeback on the device so > + * this is safe. But just in case we'll assert to make sure we don't > + * screw this up. > + */ > + if (!sb_is_blkdev_sb(inode->i_sb)) The condition and the comment can go away now. We no longer need to call inode_to_bdi()... > + sb_clear_inode_writeback(inode); > + BUG_ON(!list_empty(&inode->i_wb_list)); > } > > /* > @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty); > */ > static void wait_sb_inodes(struct super_block *sb) > { > - struct inode *inode, *old_inode = NULL; > + LIST_HEAD(sync_list); > > /* > * We need to be protected against the filesystem going from > @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb) > */ > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > + /* > + * Data integrity sync. Must wait for all pages under writeback, because > + * there may have been pages dirtied before our sync call, but which had > + * writeout started before we write it out. In which case, the inode > + * may not be on the dirty list, but we still have to wait for that > + * writeout. > + * > + * To avoid syncing inodes put under IO after we have started here, > + * splice the io list to a temporary list head and walk that. Newly > + * dirtied inodes will go onto the primary list so we won't wait for > + * them. This is safe to do as we are serialised by the s_sync_lock, > + * so we'll complete processing the complete list before the next > + * sync operation repeats the splice-and-walk process. > + * > + * s_inode_wblist_lock protects the wb list and is irq-safe as it is > + * acquired inside of the mapping lock by __test_set_page_writeback(). > + * We cannot acquire i_lock while the wblist lock is held without > + * introducing irq inversion issues. Since s_inodes_wb is a subset of > + * s_inodes, use s_inode_list_lock to prevent inodes from disappearing > + * until we have a reference. Note that s_inode_wblist_lock protects the > + * local sync_list as well because inodes can be dropped from either > + * list by writeback completion. > + */ > mutex_lock(&sb->s_sync_lock); > + > spin_lock(&sb->s_inode_list_lock); So I'm not very happy that we have to hold s_inode_list_lock here. After all reducing contention on that lock was one of the main motivations of this patch. That being said I think the locking is correct now which is a good start :) and we can work from here. As a future improvement I think we could use RCU to grab inode reference safely like: Hold rcu_read_lock() instead of s_inode_list_lock. That guarantees that memory for the inode we have reference to is not freed. So we can then: spin_unlock_irq(&sb->s_inode_wblist_lock); spin_lock(&inode->i_lock); if (inode->i_flags & (I_FREEING | I_WILL_FREE | I_NEW)) avoid touching inode __iget(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); But I'd do this as a separate patch and run this through Al to verify... > + spin_lock_irq(&sb->s_inode_wblist_lock); > + list_splice_init(&sb->s_inodes_wb, &sync_list); > > - /* > - * Data integrity sync. Must wait for all pages under writeback, > - * because there may have been pages dirtied before our sync > - * call, but which had writeout started before we write it out. > - * In which case, the inode may not be on the dirty list, but > - * we still have to wait for that writeout. > - */ > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + while (!list_empty(&sync_list)) { > + struct inode *inode = list_first_entry(&sync_list, struct inode, > + i_wb_list); > struct address_space *mapping = inode->i_mapping; > > + list_del_init(&inode->i_wb_list); What if we just did: list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); here instead of list_del_init()? That way we would always maintain consistency: PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list) under s_inode_wblist_lock and thus you could just delete that list shuffling below... > + /* > + * The inode has been written back. Check whether we still have > + * pages under I/O and move the inode back to the primary list > + * if necessary. sb_mark_inode_writeback() might not have done > + * anything if the writeback tag hadn't been cleared from the > + * mapping by the time more wb had started. > + */ > + spin_lock_irq(&mapping->tree_lock); > + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { > + spin_lock(&sb->s_inode_wblist_lock); > + if (list_empty(&inode->i_wb_list)) > + list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); > + spin_unlock(&sb->s_inode_wblist_lock); > + } else > + WARN_ON(!list_empty(&inode->i_wb_list)); > + spin_unlock_irq(&mapping->tree_lock); > + > + iput(inode); > + > spin_lock(&sb->s_inode_list_lock); > + spin_lock_irq(&sb->s_inode_wblist_lock); > } > + > + spin_unlock_irq(&sb->s_inode_wblist_lock); > spin_unlock(&sb->s_inode_list_lock); > - iput(old_inode); > + > mutex_unlock(&sb->s_sync_lock); > } > ... > + > + /* > + * we can come through here when swapping anonymous > + * pages, so we don't necessarily have an inode to > + * track for sync. Inodes are remove lazily, so there is > + * no equivalent in test_clear_page_writeback(). The comment about lazy removal is not true anymore... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html