Hello Tejun On 14/11/2022 4:30 am, Imran Khan wrote: [...] > > Below is reasoning and data for my experiments with other approaches. > > Since hashed kernfs_rwsem approach has been encountering problems in addressing > some corner cases, I am thinking if some alternative approach can be taken here > to keep kernfs_rwsem global, but replace its usage at some places with > alternative global/hashed rwsems. > > For example from the current kernfs code we can see that most of the contention > towards kernfs_rwsem is observed in down_read/up_read emanating from > kernfs_iop_permission and kernfs_dop_revalidate: > > - 39.16% 38.98% showgids [kernel.kallsyms] [k] down_read > 38.98% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 36.54% link_path_walk > - 20.23% inode_permission > - __inode_permission > - 20.22% kernfs_iop_permission > down_read > - 15.06% walk_component > lookup_fast > d_revalidate.part.24 > kernfs_dop_revalidate > down_read > - 1.25% kernfs_iop_get_link > down_read > - 1.25% may_open > inode_permission > __inode_permission > kernfs_iop_permission > down_read > - 1.20% lookup_fast > d_revalidate.part.24 > kernfs_dop_revalidate > down_read > - 28.96% 28.83% showgids [kernel.kallsyms] [k] up_read > 28.83% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 28.42% link_path_walk > - 18.09% inode_permission > - __inode_permission > - 18.07% kernfs_iop_permission > up_read > - 9.08% walk_component > lookup_fast > - d_revalidate.part.24 > - 9.08% kernfs_dop_revalidate > up_read > - 1.25% kernfs_iop_get_link > > In the above snippet down_read/up_read of kernfs_rwsem is taking ~68% of CPU. We > also know that cache line bouncing for kernfs_rwsem is major contributor towards > this overhead because as per [2], changing kernfs_rwsem to a per-cpu > kernfs_rwsem reduced this to a large extent. > > Now kernfs_iop_permission is taking kernfs_rwsem to access inode attributes > which are not accessed in kernfs_dop_revalidate (the other path contending for > kernfs_rwsem). So if we use a separate rwsem for protecting inode attributes we > can see that contention towards kernfs_rwsem greatly reduces. For example using > a global rwsem (kernfs_iattr_rwsem) to protect inode attributes as per following > patch: > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 6acd9c3d4cff..f185427131f9 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -757,11 +757,13 @@ int kernfs_add_one(struct kernfs_node *kn) > goto out_unlock; > > /* Update timestamps on the parent */ > + down_write(&root->kernfs_iattr_rwsem); > ps_iattr = parent->iattr; > if (ps_iattr) { > ktime_get_real_ts64(&ps_iattr->ia_ctime); > ps_iattr->ia_mtime = ps_iattr->ia_ctime; > } > + up_write(&root->kernfs_iattr_rwsem); > > up_write(&root->kernfs_rwsem); > > @@ -1442,10 +1444,12 @@ static void __kernfs_remove(struct kernfs_node *kn) > pos->parent ? pos->parent->iattr : NULL; > > /* update timestamps on the parent */ > + down_write(&root->kernfs_iattr_rwsem); > if (ps_iattr) { > ktime_get_real_ts64(&ps_iattr->ia_ctime); > ps_iattr->ia_mtime = ps_iattr->ia_ctime; > } > + up_write(&root->kernfs_iattr_rwsem); > > kernfs_put(pos); > } > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > index 74f3453f4639..1b8bffc6d2d3 100644 > --- a/fs/kernfs/inode.c > +++ b/fs/kernfs/inode.c > @@ -101,9 +101,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct > iattr *iattr) > int ret; > struct kernfs_root *root = kernfs_root(kn); > > - down_write(&root->kernfs_rwsem); > + down_write(&root->kernfs_iattr_rwsem); > ret = __kernfs_setattr(kn, iattr); > - up_write(&root->kernfs_rwsem); > + up_write(&root->kernfs_iattr_rwsem); > return ret; > } > > @@ -119,7 +119,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, > struct dentry *dentry, > return -EINVAL; > > root = kernfs_root(kn); > - down_write(&root->kernfs_rwsem); > + down_write(&root->kernfs_iattr_rwsem); > error = setattr_prepare(&init_user_ns, dentry, iattr); > if (error) > goto out; > @@ -132,7 +132,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, > struct dentry *dentry, > setattr_copy(&init_user_ns, inode, iattr); > > out: > - up_write(&root->kernfs_rwsem); > + up_write(&root->kernfs_iattr_rwsem); > return error; > } > @@ -189,10 +189,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns, > struct kernfs_node *kn = inode->i_private; > struct kernfs_root *root = kernfs_root(kn); > > - down_read(&root->kernfs_rwsem); > + down_read(&root->kernfs_iattr_rwsem); > kernfs_refresh_inode(kn, inode); > generic_fillattr(&init_user_ns, inode, stat); > - up_read(&root->kernfs_rwsem); > + up_read(&root->kernfs_iattr_rwsem); > > return 0; > } > > @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns, > kn = inode->i_private; > root = kernfs_root(kn); > > - down_read(&root->kernfs_rwsem); > + down_read(&root->kernfs_iattr_rwsem); > kernfs_refresh_inode(kn, inode); > ret = generic_permission(&init_user_ns, inode, mask); > - up_read(&root->kernfs_rwsem); > + up_read(&root->kernfs_iattr_rwsem); > > return ret; > } > > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h > index fc5821effd97..4620b74f44b0 100644 > --- a/fs/kernfs/kernfs-internal.h > +++ b/fs/kernfs/kernfs-internal.h > @@ -47,6 +47,7 @@ struct kernfs_root { > > wait_queue_head_t deactivate_waitq; > struct rw_semaphore kernfs_rwsem; > + struct rw_semaphore kernfs_iattr_rwsem; > }; > > > > greatly reduces the CPU usage seen earlier in down_read/up_read: > > > - 13.08% 13.02% showgids [kernel.kallsyms] [k] down_read > 13.02% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 12.18% link_path_walk > - 9.44% inode_permission > - __inode_permission > - 9.43% kernfs_iop_permission > down_read > - 2.53% walk_component > lookup_fast > d_revalidate.part.24 > kernfs_dop_revalidate > down_read > + 0.62% may_open > - 13.03% 12.97% showgids [kernel.kallsyms] [k] up_read > 12.97% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 12.86% link_path_walk > - 11.55% inode_permission > - __inode_permission > - 11.54% kernfs_iop_permission > up_read > - 1.06% walk_component > lookup_fast > - d_revalidate.part.24 > - 1.06% kernfs_dop_revalidate > > As can be seen down_read/up_read CPU usage is ~26% (compared to ~68% of default > case). > > Further using a hashed rwsem for protecting inode attributes as per following patch: > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c > index 6acd9c3d4cff..dfc0d2167d86 100644 > --- a/fs/kernfs/dir.c > +++ b/fs/kernfs/dir.c > @@ -734,6 +734,7 @@ int kernfs_add_one(struct kernfs_node *kn) > struct kernfs_iattrs *ps_iattr; > bool has_ns; > int ret; > + struct rw_semaphore *rwsem; > > down_write(&root->kernfs_rwsem); > > @@ -757,11 +758,13 @@ int kernfs_add_one(struct kernfs_node *kn) > goto out_unlock; > > /* Update timestamps on the parent */ > + rwsem = kernfs_iattr_down_write(kn); > ps_iattr = parent->iattr; > if (ps_iattr) { > ktime_get_real_ts64(&ps_iattr->ia_ctime); > ps_iattr->ia_mtime = ps_iattr->ia_ctime; > } > + kernfs_iattr_up_write(rwsem, kn); > > up_write(&root->kernfs_rwsem); > > @@ -1443,8 +1446,10 @@ static void __kernfs_remove(struct kernfs_node *kn) > > /* update timestamps on the parent */ > if (ps_iattr) { > + struct rw_semaphore *rwsem = > kernfs_iattr_down_write(pos->parent); > ktime_get_real_ts64(&ps_iattr->ia_ctime); > ps_iattr->ia_mtime = ps_iattr->ia_ctime; > + kernfs_iattr_up_write(rwsem, kn); > } > > kernfs_put(pos); > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > index 74f3453f4639..2b3cd5a9464f 100644 > --- a/fs/kernfs/inode.c > +++ b/fs/kernfs/inode.c > @@ -99,11 +99,12 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct > iattr *iattr) > int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr) > { > int ret; > + struct rw_semaphore *rwsem; > struct kernfs_root *root = kernfs_root(kn); > > - down_write(&root->kernfs_rwsem); > + rwsem = kernfs_iattr_down_write(kn); > ret = __kernfs_setattr(kn, iattr); > - up_write(&root->kernfs_rwsem); > + kernfs_iattr_up_write(rwsem, kn); > return ret; > } > > @@ -114,12 +115,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, > struct dentry *dentry, > struct kernfs_node *kn = inode->i_private; > struct kernfs_root *root; > int error; > + struct rw_semaphore *rwsem; > > if (!kn) > return -EINVAL; > > root = kernfs_root(kn); > - down_write(&root->kernfs_rwsem); > + rwsem = kernfs_iattr_down_write(kn); > error = setattr_prepare(&init_user_ns, dentry, iattr); > if (error) > goto out; > @@ -132,7 +134,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, > struct dentry *dentry, > setattr_copy(&init_user_ns, inode, iattr); > > out: > - up_write(&root->kernfs_rwsem); > + kernfs_iattr_up_write(rwsem, kn); > return error; > } > > @@ -188,11 +190,12 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns, > struct inode *inode = d_inode(path->dentry); > struct kernfs_node *kn = inode->i_private; > struct kernfs_root *root = kernfs_root(kn); > + struct rw_semaphore *rwsem; > > - down_read(&root->kernfs_rwsem); > + rwsem = kernfs_iattr_down_read(kn); > kernfs_refresh_inode(kn, inode); > generic_fillattr(&init_user_ns, inode, stat); > - up_read(&root->kernfs_rwsem); > + kernfs_iattr_up_read(rwsem, kn); > > return 0; > } > @@ -278,6 +281,7 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns, > struct kernfs_node *kn; > struct kernfs_root *root; > int ret; > + struct rw_semaphore *rwsem; > > if (mask & MAY_NOT_BLOCK) > return -ECHILD; > @@ -285,10 +289,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns, > kn = inode->i_private; > root = kernfs_root(kn); > > - down_read(&root->kernfs_rwsem); > + rwsem = kernfs_iattr_down_read(kn); > kernfs_refresh_inode(kn, inode); > ret = generic_permission(&init_user_ns, inode, mask); > - up_read(&root->kernfs_rwsem); > + kernfs_iattr_up_read(rwsem, kn); > > return ret; > } > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h > index fc5821effd97..bd1ecd126395 100644 > --- a/fs/kernfs/kernfs-internal.h > +++ b/fs/kernfs/kernfs-internal.h > @@ -169,4 +169,53 @@ extern const struct inode_operations kernfs_symlink_iops; > * kernfs locks > */ > extern struct kernfs_global_locks *kernfs_locks; > + > +static inline struct rw_semaphore *kernfs_iattr_rwsem_ptr(struct kernfs_node *kn) > +{ > + int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS); > + > + return &kernfs_locks->iattr_rwsem[idx]; > +} > + > +static inline struct rw_semaphore *kernfs_iattr_down_write(struct kernfs_node *kn) > +{ > + struct rw_semaphore *rwsem; > + > + kernfs_get(kn); > + > + rwsem = kernfs_iattr_rwsem_ptr(kn); > + > + down_write(rwsem); > + > + return rwsem; > +} > + > +static inline void kernfs_iattr_up_write(struct rw_semaphore *rwsem, > + struct kernfs_node *kn) > +{ > + up_write(rwsem); > + > + kernfs_put(kn); > +} > + > + > +static inline struct rw_semaphore *kernfs_iattr_down_read(struct kernfs_node *kn) > +{ > + struct rw_semaphore *rwsem; > + > + kernfs_get(kn); > + > + rwsem = kernfs_iattr_rwsem_ptr(kn); > + > + down_read(rwsem); > + > + return rwsem; > +} > + > +static inline void kernfs_iattr_up_read(struct rw_semaphore *rwsem, > + struct kernfs_node *kn) > +{ > + up_read(rwsem); > + > + kernfs_put(kn); > +} > #endif /* __KERNFS_INTERNAL_H */ > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > index d0859f72d2d6..f282e5d65d04 100644 > --- a/fs/kernfs/mount.c > +++ b/fs/kernfs/mount.c > @@ -392,8 +392,10 @@ static void __init kernfs_mutex_init(void) > { > int count; > > - for (count = 0; count < NR_KERNFS_LOCKS; count++) > + for (count = 0; count < NR_KERNFS_LOCKS; count++) { > mutex_init(&kernfs_locks->open_file_mutex[count]); > + init_rwsem(&kernfs_locks->iattr_rwsem[count]); > + } > } > > static void __init kernfs_lock_init(void) > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h > index 73f5c120def8..fcbf5e7c921c 100644 > --- a/include/linux/kernfs.h > +++ b/include/linux/kernfs.h > @@ -89,6 +89,7 @@ struct kernfs_iattrs; > */ > struct kernfs_global_locks { > struct mutex open_file_mutex[NR_KERNFS_LOCKS]; > + struct rw_semaphore iattr_rwsem[NR_KERNFS_LOCKS]; > }; > > enum kernfs_node_type { > > > gives further improvement in CPU usage: > > - 8.26% 8.22% showgids [kernel.kallsyms] [k] down_read > 8.19% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 7.59% link_path_walk > - 6.66% walk_component > lookup_fast > - d_revalidate.part.24 > - 6.66% kernfs_dop_revalidate > down_read > - 0.71% kernfs_iop_get_link > down_read > - 0.58% lookup_fast > d_revalidate.part.24 > kernfs_dop_revalidate > down_read > - 7.44% 7.41% showgids [kernel.kallsyms] [k] up_read > 7.39% __libc_start_main > __open_nocancel > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_open > do_sys_open > do_filp_open > - path_openat > - 7.36% link_path_walk > - 6.45% walk_component > lookup_fast > d_revalidate.part.24 > kernfs_dop_revalidate > up_read > > In above snippet CPU usage in down_read/up_read has gone down to ~16% > > So do you think that rather than replacing global kernfs_rwsem with a hashed one > , any of the above mentioned 2 patches (1. Use a global rwsem for protecting > inode attributes or 2. Use a hashed rwsem for protecting inode attributes) > can be used. These patches are not breaking code paths involving multiple nodes > that currently use global kernfs_rwsem. > With hashed kernfs_rwsem I guess there will always be a risk of some corner case > getting missed. > > If any of these approaches are acceptable, I can send the patch for review along > with other changes of this series > Could you please share your feedback about the above mentioned 2 approaches to reduce contention around global kernfs_rwsem ? Also the first 2 patches of this series [1] are not dealing specifically with with kernfs_rwsem, so could you please share your feedback about those 2 as well. Thanks, -- Imran [1]: https://lore.kernel.org/lkml/20220810111017.2267160-1-imran.f.khan@xxxxxxxxxx/