On Wed 13-01-16 09:33:05, Brian Foster wrote: > On Wed, Jan 13, 2016 at 11:43:02AM +0100, Jan Kara wrote: > > On Tue 12-01-16 13:35:38, Brian Foster wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > wait_sb_inodes() currently does a walk of all inodes in the > > > filesystem to find dirty one to wait on during sync. This is highly > > > inefficient and wastes a lot of CPU when there are lots of clean > > > cached inodes that we don't need to wait on. > > > > > > To avoid this "all inode" walk, we need to track inodes that are > > > currently under writeback that we need to wait for. We do this by > > > adding inodes to a writeback list on the sb when the mapping is > > > first tagged as having pages under writeback. wait_sb_inodes() can > > > then walk this list of "inodes under IO" and wait specifically just > > > for the inodes that the current sync(2) needs to wait for. > > > > > > Define a couple helpers to add/remove an inode from the writeback > > > list and call them when the overall mapping is tagged for or cleared > > > from writeback. Update wait_sb_inodes() to walk only the inodes > > > under writeback due to the sync. > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > Signed-off-by: Josef Bacik <jbacik@xxxxxx> > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > > ... > > > > > void inode_wait_for_writeback(struct inode *inode) > > > { > > > + BUG_ON(!(inode->i_state & I_FREEING)); > > > + > > > spin_lock(&inode->i_lock); > > > __inode_wait_for_writeback(inode); > > > spin_unlock(&inode->i_lock); > > > + > > > + /* > > > + * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for > > > + * the inode and then deref bdev->bd_disk, which at this point has been > > > + * set to NULL, so we would panic. At the point we are dropping our > > > + * bd_inode we won't have any pages under writeback on the device so > > > + * this is safe. But just in case we'll assert to make sure we don't > > > + * screw this up. > > > + */ > > > + if (!sb_is_blkdev_sb(inode->i_sb)) > > > > The condition and the comment can go away now. We no longer need to call > > inode_to_bdi()... > > > > Yeah, I noted that in my follow on mail. I think this whole hunk along > with the change to kill_bdev() can both actually get dropped. > kill_bdev() does truncate_inodes_pages() which eventually calls > wait_on_page_writeback() for every page in the mapping. This function > clearly already waits for inode writeback. > > Unless I'm misunderstanding something, the original changes in both > places look like they were just there to ensure that the inode is > removed from the wb list when wb completion didn't do so. That is no > longer necessary now that wb completion takes care of that. I'll do some > more testing of course, but any objections if I drop these hunks? I agree with removal. > Alternatively, we could retain only the BUG_ON() bits here to make sure > everything works as expected. Thoughts? Just BUG_ON(!list_empty(&inode->i_wb_list)) in clear_inode() should be enough... > > > + sb_clear_inode_writeback(inode); > > > + BUG_ON(!list_empty(&inode->i_wb_list)); > > > } > > > > > > /* > > > @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty); > > > */ > > > static void wait_sb_inodes(struct super_block *sb) > > > { > > > - struct inode *inode, *old_inode = NULL; > > > + LIST_HEAD(sync_list); > > > > > > /* > > > * We need to be protected against the filesystem going from > > > @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb) > > > */ > > > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > > > > > + /* > > > + * Data integrity sync. Must wait for all pages under writeback, because > > > + * there may have been pages dirtied before our sync call, but which had > > > + * writeout started before we write it out. In which case, the inode > > > + * may not be on the dirty list, but we still have to wait for that > > > + * writeout. > > > + * > > > + * To avoid syncing inodes put under IO after we have started here, > > > + * splice the io list to a temporary list head and walk that. Newly > > > + * dirtied inodes will go onto the primary list so we won't wait for > > > + * them. This is safe to do as we are serialised by the s_sync_lock, > > > + * so we'll complete processing the complete list before the next > > > + * sync operation repeats the splice-and-walk process. > > > + * > > > + * s_inode_wblist_lock protects the wb list and is irq-safe as it is > > > + * acquired inside of the mapping lock by __test_set_page_writeback(). > > > + * We cannot acquire i_lock while the wblist lock is held without > > > + * introducing irq inversion issues. Since s_inodes_wb is a subset of > > > + * s_inodes, use s_inode_list_lock to prevent inodes from disappearing > > > + * until we have a reference. Note that s_inode_wblist_lock protects the > > > + * local sync_list as well because inodes can be dropped from either > > > + * list by writeback completion. > > > + */ > > > mutex_lock(&sb->s_sync_lock); > > > + > > > spin_lock(&sb->s_inode_list_lock); > > > > So I'm not very happy that we have to hold s_inode_list_lock here. After > > all reducing contention on that lock was one of the main motivations of > > this patch. That being said I think the locking is correct now which is a > > good start :) and we can work from here. > > > > Thanks.. I was shooting for correct and as simple as possible. ;) FWIW, > the primary motivation that I'm aware of for this patch is the current > behavior of unnecessarily walking a massive inode list when the inode > cache is very large (moreso than lock contention). I suspect lock > contention in this path would more likely involve the wblist_lock at > this point, since it protects the wblist. Well, likely depends on the workload. Under some workloads the contention on sb_list_lock is heavy because a) sync uses it heavily b) inode reclaim needs it c) loading of new inodes into memory needs it Now your patch shortens the lock hold times for sync which should help but not having to touch it would be even better :). > > > + spin_lock_irq(&sb->s_inode_wblist_lock); > > > + list_splice_init(&sb->s_inodes_wb, &sync_list); > > > > > > - /* > > > - * Data integrity sync. Must wait for all pages under writeback, > > > - * because there may have been pages dirtied before our sync > > > - * call, but which had writeout started before we write it out. > > > - * In which case, the inode may not be on the dirty list, but > > > - * we still have to wait for that writeout. > > > - */ > > > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > > > + while (!list_empty(&sync_list)) { > > > + struct inode *inode = list_first_entry(&sync_list, struct inode, > > > + i_wb_list); > > > struct address_space *mapping = inode->i_mapping; > > > > > > + list_del_init(&inode->i_wb_list); > > > > What if we just did: > > > > list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); > > > > here instead of list_del_init()? That way we would always maintain > > consistency: > > > > PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list) > > > > under s_inode_wblist_lock and thus you could just delete that list > > shuffling below... > > > > I assume you mean list_move_tail(), but yeah I think you're right. We Yes, I meant list_move_tail(). > can still use the mapping tag check optimization without removing the > inode, since there's no risk of leaving it on a local list after the > function returns. Beyond that, once the lock is dropped the list state > is simply dependent on whether more wb starts after the immediately > previous wait. I'll stare at it some more, but it sounds like a nice > cleanup to me. Exactly. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html