Re: [PATCH v3] ovl: Improving syncfs efficiency

Chengguang Xu <cgxu519@xxxxxxxxxx> · Fri, 12 Jan 2018 15:13:26 +0800

> 
> 在 2018年1月11日，下午6:07，Jan Kara <jack@xxxxxxx> 写道：
> 
> On Thu 11-01-18 05:30:33, Amir Goldstein wrote:
>> [CC: Jan Kara]
>> 
>> On Thu, Jan 11, 2018 at 3:48 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>> regardless of the overlay ownership of the inode. It has obvious
>>> shortcomings as below.
>>> 
>>> (1) Performance
>>> Synchronization is probably heavy in most cases, especially when upper
>>> filesystem is not dedicated to target overlayfs.
>>> 
>>> (2) Interference
>>> Unplanned synchronization will probably impact IO performance of the
>>> processes which use other overlayfs on same upper filesystem or directly
>>> use upper filesystem.
>>> 
>>> This patch iterates overlay inodes to only sync target dirty inodes in
>>> upper filesystem. By doing this, It is able to reduce cost of synchronization
>>> and will not seriously impact IO performance of irrelative processes.
>>> 
>>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx>
>> 
>> Looks good.
>> 
>> Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>
>> 
>> Jan,
>> 
>> Would you mind taking a quick look at this?
>> 
>> Overlayfs inodes have no dirty mapping, only the underlying upper inodes
>> do, so sync_inodes_sb() is not enough for overlay syncfs().
>> 
>> This code creates the list of upper inodes to sync, syncs them and then
>> calls the underlying upper fs sync_fs().
>> 
>> Does this look reasonable?
> 
> If I understand the original changelog, this is a performance motivated
> change. As such it should have a some measurements accompanied with it to
> justify the change. I.e., how much does this help? What is the exact setup
> where this matters? Do users really care?
> 
> More technical comments inline.
> 
>>> @@ -80,6 +83,8 @@ struct ovl_inode {
>>> 
>>>        /* synchronize copy up and more */
>>>        struct mutex lock;
>>> +       /* ovl inode sync list */
>>> +       struct list_head sync_list;
>>> };
> 
> Additional overhead of 16-bytes per inode - not a huge deal since this is
> overlay specific but it's something to keep in mind. Could not we reuse
> i_io_list of the overlay inode instead?

Hi Jan,

Thanks for your review and comments. 
Actually I thought i_io_list in my very first version, however, it implies
that inode is connected to proper bdi_writeback but overlay does not have,
so it might be incompatible in some contexts(for example, evict). 
Maybe we can reuse i_wb_list instead.

> 
>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>> index 76440fe..947bb83 100644
>>> --- a/fs/overlayfs/super.c
>>> +++ b/fs/overlayfs/super.c
> ...
>>> +static void ovl_sync_inodes(struct super_block *sb)
>>> +{
>>> +       struct ovl_fs *ofs = sb->s_fs_info;
>>> +       struct inode *inode, *i_next;
>>> +       struct inode *upper_inode;
>>> +       struct blk_plug plug;
>>> +
>>> +       struct writeback_control wbc = {
>>> +               .sync_mode              = WB_SYNC_ALL,
>>> +               .for_sync               = 1,
>>> +               .range_start            = 0,
>>> +               .range_end              = LLONG_MAX,
>>> +               .nr_to_write            = LONG_MAX,
>>> +       };
>>> +
>>> +       blk_start_plug(&plug);
>>> +       spin_lock(&sb->s_inode_list_lock);
>>> +       list_for_each_entry_safe(inode, i_next, &sb->s_inodes, i_sb_list) {
> 
> Why a _safe iteration variant here? If you hope it allows you to safely
> resume iteration after dropping sb->s_inode_list_lock, you are wrong
> because i_next inode can be removed from the list once you drop that lock
> as well.  You need to be more careful when iterating inodes. See how e.g.
> fs/notify/fsnotify.c:fsnotify_unmount_inodes() plays tricks with always
> keeping one inode reference to be able to resume iteration of the inode
> list.
> 
> Another issue is that this will iterate all inodes on the overlayfs list.
> Both clean and dirty ones (well, those where underlying upper inode is
> dirty). When there are only a few underlying dirty inodes, this may be
> considerably slower than the original behavior. So this needs an
> explanation in the changelog why the use case you care about is more
> important than the use case this regresses. You might think that this will
> only add an iteration of an in memory list, but on large machines, this
> list may have millions of entries so it's iteration is not exactly fast.
> 
>>> +               upper_inode = ovl_inode_upper(inode);
>>> +               if (!upper_inode)
>>> +                       continue;
>>> +
>>> +               spin_lock(&upper_inode->i_lock);
>>> +               if (upper_inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE) ||
>>> +                       list_empty(&upper_inode->i_io_list)) {
>>> +                       spin_unlock(&upper_inode->i_lock);
>>> +                       continue;
>>> +               }
>>> +               spin_unlock(&upper_inode->i_lock);
>>> +
>>> +               if (!igrab(inode))
>>> +                       continue;
>>> +
>>> +               if (!igrab(upper_inode)) {
>>> +                       iput(inode);
> 
> iput() under a spinlock? Not good, that always needs a very good
> explanation why this cannot be the last reference to drop. Preferably avoid
> this altogether.

Hi Amir,

I reckon upper inode won't be deleted before overlay inode，so we can just
hold reference of overlay inode here, am i right?

Thanks,
Chengguang.
--
To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html