Re: [PATCH v6] ovl: Improving syncfs efficiency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> 
> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@xxxxxxxxx> 写道:
> 
> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@xxxxxxxxxx> wrote:
>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>> regardless of the overlay ownership of the inode. In the use case of
>> container, when multiple containers using the same underlying upper
>> filesystem, it has some shortcomings as below.
>> 
>> (1) Performance
>> Synchronization is probably heavy because it actually syncs unnecessary
>> inodes for target overlayfs.
>> 
>> (2) Interference
>> Unplanned synchronization will probably impact IO performance of
>> unrelated container processes on the other overlayfs.
>> 
>> This patch iterates upper inode list in overlayfs to only sync target
>> dirty inodes and wait for completion. By doing this, It is able to reduce
>> cost of synchronization and will not seriously impact IO performance of
>> unrelated processes. In special case, when having very few dirty inodes
>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>> time so that synchronization is slower than before, but we think the
>> potential for performance improvement far surpass the potential for
>> performance regression in most cases.
>> 
>> Signed-off-by: Chengguang Xu <cgxu519@xxxxxxxxxx>
>> ---
>> Changes since v5:
>> - Move sync related functions to new file sync.c.
>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
> 
> typo 'indo' and in comment below as well.
> 
>> will wait until OVL_EVICT_PENDING to be cleared.
>> - Move inode sync operation into evict_inode from drop_inode.
>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>> inodes.
>> - Hold s_inode_wblist_lock until deleting item from list.
>> - Remove write_inode fuction because no more used.
>> 
>> Changes since v4:
>> - Add syncing operation and deleting from upper_inodes list
>> during ovl inode destruction.
>> - Export symbol of inode_wait_for_writeback() for waiting
>> writeback on ovl inode.
>> - Reuse I_SYNC flag to avoid race conditions between syncfs
>> and drop_inode.
>> 
>> Changes since v3:
>> - Introduce upper_inode list to organize ovl indoe which has upper inode.
>> - Avoid iput operation inside spin lock.
>> - Change list iteration method for safety.
>> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
>> - Modify commit log and add more comments to the code.
>> 
>> Changes since v2:
>> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
>> overlayfs.
>> - Factoring out sync operation to different helpers for easy understanding.
>> 
>> Changes since v1:
>> - If upper filesystem is readonly mode then skip synchronization.
>> - Introduce global wait list to replace temporary wait list for
>> concurrent synchronization.
>> 
>> fs/overlayfs/Makefile    |   2 +-
>> fs/overlayfs/overlayfs.h |   8 ++
>> fs/overlayfs/ovl_entry.h |   5 +
>> fs/overlayfs/super.c     |  44 +++-----
>> fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
>> fs/overlayfs/util.c      |  23 ++++-
>> 6 files changed, 305 insertions(+), 33 deletions(-)
>> create mode 100644 fs/overlayfs/sync.c
>> 
>> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
>> index 99373bb..6e92c0e 100644
>> --- a/fs/overlayfs/Makefile
>> +++ b/fs/overlayfs/Makefile
>> @@ -4,4 +4,4 @@
>> 
>> obj-$(CONFIG_OVERLAY_FS) += overlay.o
>> 
>> -overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
>> +overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>> index b489099..c6a9ecd 100644
>> --- a/fs/overlayfs/overlayfs.h
>> +++ b/fs/overlayfs/overlayfs.h
>> @@ -34,6 +34,8 @@ enum ovl_flag {
>>        /* Non-merge dir that may contain whiteout entries */
>>        OVL_WHITEOUTS,
>>        OVL_INDEX,
>> +       /* Set when ovl inode is about to be freed */
>> +       OVL_EVICT_PENDING,
>> };
>> 
>> /*
>> @@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
>> int ovl_nlink_start(struct dentry *dentry, bool *locked);
>> void ovl_nlink_end(struct dentry *dentry, bool locked);
>> int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
>> +void ovl_detach_upper_inodes_list(struct inode *inode);
>> 
>> static inline bool ovl_is_impuredir(struct dentry *dentry)
>> {
>> @@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
>> int ovl_copy_xattr(struct dentry *old, struct dentry *new);
>> int ovl_set_attr(struct dentry *upper, struct kstat *stat);
>> struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
>> +
>> +/* sync.c */
>> +int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);
> 
> leftover
> 
>> +void ovl_evict_inode(struct inode *inode);
>> +int ovl_sync_fs(struct super_block *sb, int wait);
>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
>> index 9d0bc03..53909ba 100644
>> --- a/fs/overlayfs/ovl_entry.h
>> +++ b/fs/overlayfs/ovl_entry.h
>> @@ -52,6 +52,9 @@ struct ovl_fs {
>>        /* Did we take the inuse lock? */
>>        bool upperdir_locked;
>>        bool workdir_locked;
>> +       /* upper inode list and lock */
>> +       spinlock_t upper_inodes_lock;
>> +       struct list_head upper_inodes;
>> };
>> 
>> /* private information held for every overlayfs dentry */
>> @@ -80,6 +83,8 @@ struct ovl_inode {
>> 
>>        /* synchronize copy up and more */
>>        struct mutex lock;
>> +       /* upper inodes list */
>> +       struct list_head upper_inodes_list;
>> };
>> 
>> static inline struct ovl_inode *OVL_I(struct inode *inode)
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 76440fe..9315155 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>>        oi->__upperdentry = NULL;
>>        oi->lower = NULL;
>>        mutex_init(&oi->lock);
>> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>> 
>>        return &oi->vfs_inode;
>> }
>> @@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
>>        ovl_free_fs(ofs);
>> }
>> 
>> -/* Sync real dirty inodes in upper filesystem (if it exists) */
>> -static int ovl_sync_fs(struct super_block *sb, int wait)
>> -{
>> -       struct ovl_fs *ofs = sb->s_fs_info;
>> -       struct super_block *upper_sb;
>> -       int ret;
>> -
>> -       if (!ofs->upper_mnt)
>> -               return 0;
>> -
>> -       /*
>> -        * If this is a sync(2) call or an emergency sync, all the super blocks
>> -        * will be iterated, including upper_sb, so no need to do anything.
>> -        *
>> -        * If this is a syncfs(2) call, then we do need to call
>> -        * sync_filesystem() on upper_sb, but enough if we do it when being
>> -        * called with wait == 1.
>> -        */
>> -       if (!wait)
>> -               return 0;
>> -
>> -       upper_sb = ofs->upper_mnt->mnt_sb;
>> -
>> -       down_read(&upper_sb->s_umount);
>> -       ret = sync_filesystem(upper_sb);
>> -       up_read(&upper_sb->s_umount);
>> -
>> -       return ret;
>> -}
>> -
>> /**
>>  * ovl_statfs
>>  * @sb: The overlayfs super block
>> @@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>>        return 0;
>> }
>> 
>> +static int ovl_drop_inode(struct inode *inode)
>> +{
>> +       if (ovl_inode_upper(inode))
>> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
>> +       return 1;
>> +}
>> +
>> static const struct super_operations ovl_super_operations = {
>>        .alloc_inode    = ovl_alloc_inode,
>>        .destroy_inode  = ovl_destroy_inode,
>> -       .drop_inode     = generic_delete_inode,
>> +       .drop_inode     = ovl_drop_inode,
>> +       .evict_inode    = ovl_evict_inode,
>>        .put_super      = ovl_put_super,
>>        .sync_fs        = ovl_sync_fs,
>>        .statfs         = ovl_statfs,
>> @@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>        if (!ofs)
>>                goto out;
>> 
>> +       spin_lock_init(&ofs->upper_inodes_lock);
>> +       INIT_LIST_HEAD(&ofs->upper_inodes);
>> +
>>        ofs->creator_cred = cred = prepare_creds();
>>        if (!cred)
>>                goto out_err;
>> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
>> new file mode 100644
>> index 0000000..e239b0a
>> --- /dev/null
>> +++ b/fs/overlayfs/sync.c
>> @@ -0,0 +1,256 @@
>> +/*
>> + * Copyright (C) 2018 Meili Inc.
>> + * Author Chengguang Xu <cgxu519@xxxxxxxxxx>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/fs.h>
>> +#include <linux/xattr.h>
>> +#include <linux/mount.h>
>> +#include <linux/writeback.h>
>> +#include <linux/blkdev.h>
>> +#include "overlayfs.h"
>> +
>> +void ovl_evict_inode(struct inode *inode)
>> +{
>> +       if (ovl_inode_upper(inode)) {
>> +               write_inode_now(ovl_inode_upper(inode), 1);
>> +               ovl_detach_upper_inodes_list(inode);
>> +
>> +               /*
>> +                * ovl_sync_inodes() may wait until
>> +                * flag OVL_EVICT_PENDING to be cleared.
>> +                */
>> +               spin_lock(&inode->i_lock);
>> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
>> +               /* See also atomic_bitops.txt */
>> +               smp_mb__after_atomic();
>> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
>> +               spin_unlock(&inode->i_lock);
>> +               clear_inode(inode);
>> +       }
>> +}
>> +
>> +/**
>> + * ovl_sync_inodes
>> + * @sb: The overlayfs super block
>> + *
>> + * upper_inodes list is used for organizing ovl inode which has upper inode,
>> + * by iterating the list to looking for and syncing dirty upper inode.
>> + *
>> + * When starting syncing inode, we add the inode to wait list explicitly,
>> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
>> + * so those fields have slightly differnt meanings in overlayfs.
>> + */
>> +static void ovl_sync_inodes(struct super_block *sb)
>> +{
>> +       struct ovl_fs *ofs = sb->s_fs_info;
>> +       struct ovl_inode *oi, *oi_prev;
>> +       struct inode *inode, *iput_inode = NULL;
>> +       struct inode *upper_inode;
>> +       struct blk_plug plug;
>> +
>> +       struct writeback_control wbc_for_sync = {
>> +               .sync_mode              = WB_SYNC_ALL,
>> +               .for_sync               = 1,
>> +               .range_start            = 0,
>> +               .range_end              = LLONG_MAX,
>> +               .nr_to_write            = LONG_MAX,
>> +       };
>> +
>> +       blk_start_plug(&plug);
>> +       spin_lock(&ofs->upper_inodes_lock);
>> +       list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
>> +               inode = &oi->vfs_inode;
>> +               spin_lock(&inode->i_lock);
>> +
>> +               if (inode->i_state & I_NEW) {
>> +                       spin_unlock(&inode->i_lock);
>> +                       continue;
>> +               }
>> +
>> +               /*
>> +                * If ovl indoe state is I_WILL_FREE or I_FREEING,
>> +                * sync operation will be done in the ovl_evict_inode(),
>> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
>> +                */
>> +               if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
>> +                       oi_prev = list_entry(oi->upper_inodes_list.prev,
>> +                                       struct ovl_inode, upper_inodes_list);
>> +
>> +                       if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
>> +                               DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
>> +                               wait_queue_head_t *wqh;
>> +                               bool sleep;
>> +
>> +                               wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
>> +                               prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>> +                               sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
>> +                               spin_unlock(&inode->i_lock);
>> +                               spin_unlock(&ofs->upper_inodes_lock);
>> +                               if (sleep)
>> +                                       schedule();
>> +                               finish_wait(wqh, &wait.wq_entry);
> 
> Please move all this to helper ovl_wait_evict_pending()
> 
> 
> 
>> +                               /* We need to do this as 'inode' can be freed by now */
> 
> Could oi_prev have been freed or removed from upper_inodes list?
> I don't see why not.
> You could try to continue iteration from iput_inode if it is not NULL,
> but what if it is NULL?

When iterating the list we either get reference or wait until the item gets deleted,
so I think there is no chance let previous inode becomes NULL.


> Maybe instead of iput_inode = NULL when adding to inodes_wb list
> get another reference to inode and always store iput_inode = inode
> to use as prev in this case?

Maybe it’s better than current, but I think it’s more complex to understand
the behavior, so please let me wait for more comments for the approach.


> 
> It would be good if you wrote the "life cycle" of ovl_inode in a
> header comment to
> this file because it has become non-trivial to follow.
> An explicit mention of locking order is also very good to have in
> header comment.
> 
> 

Let me add in next version.

Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING
for ovl_inode which has upper inode, is it acceptable? 

Thanks,
Chengguang.

--
To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystems Devel]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux