Re: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Tejun

On 14/11/2022 4:30 am, Imran Khan wrote:

[...]
> 
> Below is reasoning and data for my experiments with other approaches.
> 
> Since hashed kernfs_rwsem approach has been encountering problems in addressing
> some corner cases, I am thinking if some alternative approach can be taken here
> to keep kernfs_rwsem global, but replace its usage at some places with
> alternative global/hashed rwsems.
> 
> For example from the current kernfs code we can see that most of the contention
> towards kernfs_rwsem is observed in down_read/up_read emanating from
> kernfs_iop_permission and kernfs_dop_revalidate:
> 
> 	-   39.16%    38.98%  showgids     [kernel.kallsyms]      [k] down_read
>              38.98% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 36.54% link_path_walk
>                     - 20.23% inode_permission
>                        - __inode_permission
>                           - 20.22% kernfs_iop_permission
>                                down_read
>                     - 15.06% walk_component
>                          lookup_fast
>                          d_revalidate.part.24
>                          kernfs_dop_revalidate
>                          down_read
>                     - 1.25% kernfs_iop_get_link
>                          down_read
>                  - 1.25% may_open
>                       inode_permission
>                       __inode_permission
>                       kernfs_iop_permission
>                       down_read
>                  - 1.20% lookup_fast
>                       d_revalidate.part.24
>                       kernfs_dop_revalidate
>                       down_read
> 	-   28.96%    28.83%  showgids     [kernel.kallsyms]       [k] up_read
>              28.83% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 28.42% link_path_walk
>                     - 18.09% inode_permission
>                        - __inode_permission
>                           - 18.07% kernfs_iop_permission
>                                up_read
>                     - 9.08% walk_component
>                          lookup_fast
>                        - d_revalidate.part.24
>                           - 9.08% kernfs_dop_revalidate
>                                up_read
>                     - 1.25% kernfs_iop_get_link
> 
> In the above snippet down_read/up_read of kernfs_rwsem is taking ~68% of CPU. We
> also know that cache line bouncing for kernfs_rwsem is major contributor towards
> this overhead because as per [2], changing kernfs_rwsem to a per-cpu
> kernfs_rwsem reduced this to a large extent.
> 
> Now kernfs_iop_permission is taking kernfs_rwsem to access inode attributes
> which are not accessed in kernfs_dop_revalidate (the other path contending for
> kernfs_rwsem). So if we use a separate rwsem for protecting inode attributes we
> can see that contention towards kernfs_rwsem greatly reduces. For example using
> a global rwsem (kernfs_iattr_rwsem) to protect inode attributes as per following
> patch:
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..f185427131f9 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -757,11 +757,13 @@ int kernfs_add_one(struct kernfs_node *kn)
>                 goto out_unlock;
> 
>         /* Update timestamps on the parent */
> +       down_write(&root->kernfs_iattr_rwsem);
>         ps_iattr = parent->iattr;
>         if (ps_iattr) {
>                 ktime_get_real_ts64(&ps_iattr->ia_ctime);
>                 ps_iattr->ia_mtime = ps_iattr->ia_ctime;
>         }
> +       up_write(&root->kernfs_iattr_rwsem);
> 
>         up_write(&root->kernfs_rwsem);
> 
> @@ -1442,10 +1444,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
>                                 pos->parent ? pos->parent->iattr : NULL;
> 
>                         /* update timestamps on the parent */
> +                       down_write(&root->kernfs_iattr_rwsem);
>                         if (ps_iattr) {
>                                 ktime_get_real_ts64(&ps_iattr->ia_ctime);
>                                 ps_iattr->ia_mtime = ps_iattr->ia_ctime;
>                         }
> +                       up_write(&root->kernfs_iattr_rwsem);
> 
>                         kernfs_put(pos);
>                 }
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..1b8bffc6d2d3 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -101,9 +101,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
>         int ret;
>         struct kernfs_root *root = kernfs_root(kn);
> 
> -       down_write(&root->kernfs_rwsem);
> +       down_write(&root->kernfs_iattr_rwsem);
>         ret = __kernfs_setattr(kn, iattr);
> -       up_write(&root->kernfs_rwsem);
> +       up_write(&root->kernfs_iattr_rwsem);
>         return ret;
>  }
> 
> @@ -119,7 +119,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
>                 return -EINVAL;
> 
>         root = kernfs_root(kn);
> -       down_write(&root->kernfs_rwsem);
> +       down_write(&root->kernfs_iattr_rwsem);
>         error = setattr_prepare(&init_user_ns, dentry, iattr);
>         if (error)
>                 goto out;
> @@ -132,7 +132,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
>         setattr_copy(&init_user_ns, inode, iattr);
> 
>  out:
> -       up_write(&root->kernfs_rwsem);
> +       up_write(&root->kernfs_iattr_rwsem);
>         return error;
>  }
> @@ -189,10 +189,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
>         struct kernfs_node *kn = inode->i_private;
>         struct kernfs_root *root = kernfs_root(kn);
> 
> -       down_read(&root->kernfs_rwsem);
> +       down_read(&root->kernfs_iattr_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         generic_fillattr(&init_user_ns, inode, stat);
> -       up_read(&root->kernfs_rwsem);
> +       up_read(&root->kernfs_iattr_rwsem);
> 
>         return 0;
>  }
> 
> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
>         kn = inode->i_private;
>         root = kernfs_root(kn);
> 
> -       down_read(&root->kernfs_rwsem);
> +       down_read(&root->kernfs_iattr_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         ret = generic_permission(&init_user_ns, inode, mask);
> -       up_read(&root->kernfs_rwsem);
> +       up_read(&root->kernfs_iattr_rwsem);
> 
>         return ret;
>  }
> 
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..4620b74f44b0 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -47,6 +47,7 @@ struct kernfs_root {
> 
>         wait_queue_head_t       deactivate_waitq;
>         struct rw_semaphore     kernfs_rwsem;
> +       struct rw_semaphore     kernfs_iattr_rwsem;
>  };
> 
> 
> 
> greatly reduces the CPU usage seen earlier in down_read/up_read:
> 
> 
>  -   13.08%    13.02%  showgids       [kernel.kallsyms]       [k] down_read
>              13.02% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 12.18% link_path_walk
>                     - 9.44% inode_permission
>                        - __inode_permission
>                           - 9.43% kernfs_iop_permission
>                                down_read
>                     - 2.53% walk_component
>                          lookup_fast
>                          d_revalidate.part.24
>                          kernfs_dop_revalidate
>                          down_read
>                  + 0.62% may_open
>         -   13.03%    12.97%  showgids       [kernel.kallsyms]      [k] up_read
>              12.97% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 12.86% link_path_walk
>                     - 11.55% inode_permission
>                        - __inode_permission
>                           - 11.54% kernfs_iop_permission
>                                up_read
>                     - 1.06% walk_component
>                          lookup_fast
>                        - d_revalidate.part.24
>                           - 1.06% kernfs_dop_revalidate
> 
> As can be seen down_read/up_read CPU usage is ~26% (compared to ~68% of default
> case).
> 
> Further using a hashed rwsem for protecting inode attributes as per following patch:
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..dfc0d2167d86 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -734,6 +734,7 @@ int kernfs_add_one(struct kernfs_node *kn)
>         struct kernfs_iattrs *ps_iattr;
>         bool has_ns;
>         int ret;
> +       struct rw_semaphore *rwsem;
> 
>         down_write(&root->kernfs_rwsem);
> 
> @@ -757,11 +758,13 @@ int kernfs_add_one(struct kernfs_node *kn)
>                 goto out_unlock;
> 
>         /* Update timestamps on the parent */
> +       rwsem = kernfs_iattr_down_write(kn);
>         ps_iattr = parent->iattr;
>         if (ps_iattr) {
>                 ktime_get_real_ts64(&ps_iattr->ia_ctime);
>                 ps_iattr->ia_mtime = ps_iattr->ia_ctime;
>         }
> +       kernfs_iattr_up_write(rwsem, kn);
> 
>         up_write(&root->kernfs_rwsem);
> 
> @@ -1443,8 +1446,10 @@ static void __kernfs_remove(struct kernfs_node *kn)
> 
>                         /* update timestamps on the parent */
>                         if (ps_iattr) {
> +                               struct rw_semaphore *rwsem =
> kernfs_iattr_down_write(pos->parent);
>                                 ktime_get_real_ts64(&ps_iattr->ia_ctime);
>                                 ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> +                               kernfs_iattr_up_write(rwsem, kn);
>                         }
> 
>                         kernfs_put(pos);
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..2b3cd5a9464f 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -99,11 +99,12 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
>  int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
>  {
>         int ret;
> +       struct rw_semaphore *rwsem;
>         struct kernfs_root *root = kernfs_root(kn);
> 
> -       down_write(&root->kernfs_rwsem);
> +       rwsem = kernfs_iattr_down_write(kn);
>         ret = __kernfs_setattr(kn, iattr);
> -       up_write(&root->kernfs_rwsem);
> +       kernfs_iattr_up_write(rwsem, kn);
>         return ret;
>  }
> 
> @@ -114,12 +115,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
>         struct kernfs_node *kn = inode->i_private;
>         struct kernfs_root *root;
>         int error;
> +       struct rw_semaphore *rwsem;
> 
>         if (!kn)
>                 return -EINVAL;
> 
>         root = kernfs_root(kn);
> -       down_write(&root->kernfs_rwsem);
> +       rwsem = kernfs_iattr_down_write(kn);
>         error = setattr_prepare(&init_user_ns, dentry, iattr);
>         if (error)
>                 goto out;
> @@ -132,7 +134,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
>         setattr_copy(&init_user_ns, inode, iattr);
> 
>  out:
> -       up_write(&root->kernfs_rwsem);
> +       kernfs_iattr_up_write(rwsem, kn);
>         return error;
>  }
> 
> @@ -188,11 +190,12 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
>         struct inode *inode = d_inode(path->dentry);
>         struct kernfs_node *kn = inode->i_private;
>         struct kernfs_root *root = kernfs_root(kn);
> +       struct rw_semaphore *rwsem;
> 
> -       down_read(&root->kernfs_rwsem);
> +       rwsem = kernfs_iattr_down_read(kn);
>         kernfs_refresh_inode(kn, inode);
>         generic_fillattr(&init_user_ns, inode, stat);
> -       up_read(&root->kernfs_rwsem);
> +       kernfs_iattr_up_read(rwsem, kn);
> 
>         return 0;
>  }
> @@ -278,6 +281,7 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
>         struct kernfs_node *kn;
>         struct kernfs_root *root;
>         int ret;
> +       struct rw_semaphore *rwsem;
> 
>         if (mask & MAY_NOT_BLOCK)
>                 return -ECHILD;
> @@ -285,10 +289,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
>         kn = inode->i_private;
>         root = kernfs_root(kn);
> 
> -       down_read(&root->kernfs_rwsem);
> +       rwsem = kernfs_iattr_down_read(kn);
>         kernfs_refresh_inode(kn, inode);
>         ret = generic_permission(&init_user_ns, inode, mask);
> -       up_read(&root->kernfs_rwsem);
> +       kernfs_iattr_up_read(rwsem, kn);
> 
>         return ret;
>  }
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..bd1ecd126395 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -169,4 +169,53 @@ extern const struct inode_operations kernfs_symlink_iops;
>   * kernfs locks
>   */
>  extern struct kernfs_global_locks *kernfs_locks;
> +
> +static inline struct rw_semaphore *kernfs_iattr_rwsem_ptr(struct kernfs_node *kn)
> +{
> +       int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> +       return &kernfs_locks->iattr_rwsem[idx];
> +}
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_write(struct kernfs_node *kn)
> +{
> +       struct rw_semaphore *rwsem;
> +
> +       kernfs_get(kn);
> +
> +       rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> +       down_write(rwsem);
> +
> +       return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_write(struct rw_semaphore *rwsem,
> +                                        struct kernfs_node *kn)
> +{
> +       up_write(rwsem);
> +
> +       kernfs_put(kn);
> +}
> +
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_read(struct kernfs_node *kn)
> +{
> +       struct rw_semaphore *rwsem;
> +
> +       kernfs_get(kn);
> +
> +       rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> +       down_read(rwsem);
> +
> +       return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_read(struct rw_semaphore *rwsem,
> +                                       struct kernfs_node *kn)
> +{
> +       up_read(rwsem);
> +
> +       kernfs_put(kn);
> +}
>  #endif /* __KERNFS_INTERNAL_H */
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index d0859f72d2d6..f282e5d65d04 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -392,8 +392,10 @@ static void __init kernfs_mutex_init(void)
>  {
>         int count;
> 
> -       for (count = 0; count < NR_KERNFS_LOCKS; count++)
> +       for (count = 0; count < NR_KERNFS_LOCKS; count++) {
>                 mutex_init(&kernfs_locks->open_file_mutex[count]);
> +               init_rwsem(&kernfs_locks->iattr_rwsem[count]);
> +       }
>  }
> 
>  static void __init kernfs_lock_init(void)
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 73f5c120def8..fcbf5e7c921c 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -89,6 +89,7 @@ struct kernfs_iattrs;
>   */
>  struct kernfs_global_locks {
>         struct mutex open_file_mutex[NR_KERNFS_LOCKS];
> +       struct rw_semaphore iattr_rwsem[NR_KERNFS_LOCKS];
>  };
> 
>  enum kernfs_node_type {
> 
> 
> gives further improvement in CPU usage:
> 
> -    8.26%     8.22%  showgids         [kernel.kallsyms]       [k] down_read
>              8.19% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 7.59% link_path_walk
>                     - 6.66% walk_component
>                          lookup_fast
>                        - d_revalidate.part.24
>                           - 6.66% kernfs_dop_revalidate
>                                down_read
>                     - 0.71% kernfs_iop_get_link
>                          down_read
>                  - 0.58% lookup_fast
>                       d_revalidate.part.24
>                       kernfs_dop_revalidate
>                       down_read
>         -    7.44%     7.41%  showgids         [kernel.kallsyms]       [k] up_read
>              7.39% __libc_start_main
>                 __open_nocancel
>                 entry_SYSCALL_64_after_hwframe
>                 do_syscall_64
>                 sys_open
>                 do_sys_open
>                 do_filp_open
>               - path_openat
>                  - 7.36% link_path_walk
>                     - 6.45% walk_component
>                          lookup_fast
>                          d_revalidate.part.24
>                          kernfs_dop_revalidate
>                          up_read
> 
> In above snippet CPU usage in down_read/up_read has gone down to ~16%
> 
> So do you think that rather than replacing global kernfs_rwsem with a hashed one
> , any of the above mentioned 2 patches (1. Use a global rwsem for protecting
> inode attributes or 2. Use a hashed rwsem for protecting inode attributes)
> can be used. These patches are not breaking code paths involving multiple nodes
> that currently use global kernfs_rwsem.
> With hashed kernfs_rwsem I guess there will always be a risk of some corner case
> getting missed.
> 
> If any of these approaches are acceptable, I can send the patch for review along
> with other changes of this series
> 
Could you please share your feedback about the above mentioned 2 approaches to
reduce contention around global kernfs_rwsem ? Also the first 2 patches of this
series [1] are not dealing specifically with with kernfs_rwsem, so could you
please share your feedback about those 2 as well.

Thanks,
 -- Imran

[1]: https://lore.kernel.org/lkml/20220810111017.2267160-1-imran.f.khan@xxxxxxxxxx/



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux