Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

Will Deacon <will@xxxxxxxxxx> · Fri, 21 Aug 2020 17:02:53 +0100

On Wed, Jun 24, 2020 at 04:32:28PM +0800, Shaokun Zhang wrote:
> get_file_rcu_many, which is called by __fget_files, has used
> atomic_try_cmpxchg now and it can reduce the access number of the global
> variable to improve the performance of atomic instruction compared with
> atomic_cmpxchg. 
> 
> __fget_files does check the @f_mode with mask variable and will do some
> atomic operations on @f_count, but both are on the same cacheline.
> Many CPU cores do file access and it will cause much conflicts on @f_count. 
> If we could make the two members into different cachelines, it shall relax
> the siutations.
> 
> We have tested this on ARM64 and X86, the result is as follows:
> Syscall of unixbench has been run on Huawei Kunpeng920 with this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    3160841.4 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3160841.4   2107.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2107.2
> 
> Without this patch:
> 24 x System Call Overhead  1
> 
> System Call Overhead                    2222456.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    2222456.0   1481.6
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         1481.6
> 
> And on Intel 6248 platform with this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        4288509.1 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    4288509.1   2859.0
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2859.0
> 
> Without this patch:
> 40 CPUs in system; running 24 parallel copies of tests
> 
> System Call Overhead                        3666313.0 lps   (10.0 s, 1 samples)
> 
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    3666313.0   2444.2
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         2444.2
> 
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Mark Rutland <mark.rutland@xxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
> Signed-off-by: Yuqi Jin <jinyuqi@xxxxxxxxxx>
> Signed-off-by: Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx>
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f881a892ea7..0faeab5622fb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -955,7 +955,6 @@ struct file {
>  	 */
>  	spinlock_t		f_lock;
>  	enum rw_hint		f_write_hint;
> -	atomic_long_t		f_count;
>  	unsigned int 		f_flags;
>  	fmode_t			f_mode;
>  	struct mutex		f_pos_lock;
> @@ -979,6 +978,7 @@ struct file {
>  	struct address_space	*f_mapping;
>  	errseq_t		f_wb_err;
>  	errseq_t		f_sb_err; /* for syncfs */
> +	atomic_long_t		f_count;
>  } __randomize_layout
>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Hmm. So the microbenchmark numbers look lovely, but:

  - What impact does it actually have for real workloads?
  - How do we avoid regressing performance by innocently changing the struct
    again later on?
  - This thing is tagged with __randomize_layout, so it doesn't help anybody
    using that crazy plugin
  - What about all the other atomics and locks that share cachelines?

Will