Re: [PATCH] fs.h: Optimize file struct to prevent false sharing

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 1 Jun 2023 08:30:48 +1000

On Wed, May 31, 2023 at 10:31:09AM +0000, Chen, Zhiyin wrote:
> As Eric said, CONFIG_RANDSTRUCT_NONE is set in the default config 
> and some production environments, including Ali Cloud. Therefore, it 
> is worthful to optimize the file struct layout.
> 
> Here are the syscall test results of unixbench.

Results look good, but the devil is in the detail....

> Command: numactl -C 3-18 ./Run -c 16 syscall

So the test is restricted to a set of adjacent cores within a single
CPU socket, so all the cachelines are typically being shared within
a single socket's CPU caches. IOWs, the fact there are 224 CPUs in
the machine is largely irrelevant for this microbenchmark.

i.e. is this a microbenchmark that is going faster simply because
the working set for the specific benchmark now fits in L2 or L3
cache when it didn't before?

Does this same result occur for different CPUs types, cache sizes
and architectures? What about when the cores used by the benchmark
are spread across mulitple sockets so the cost of remote cacheline
access is taken into account? If this is actually a real benefit,
then we should see similar or even larger gains between CPU cores
that are further apart because the cost of false cacheline sharing
are higher in those systems....

> Without patch
> ------------------------
> 224 CPUs in system; running 16 parallel copies of tests
> System Call Overhead                        5611223.7 lps   (10.0 s, 7 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    5611223.7   3740.8
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         3740.8
> 
> With patch
> ------------------------------------------------------------------------
> 224 CPUs in system; running 16 parallel copies of tests
> System Call Overhead                        7567076.6 lps   (10.0 s, 7 samples)
> System Benchmarks Partial Index              BASELINE       RESULT    INDEX
> System Call Overhead                          15000.0    7567076.6   5044.7
>                                                                    ========
> System Benchmarks Index Score (Partial Only)                         5044.7

Where is all this CPU time being saved? Do you have a profile
showing what functions in the kernel are running far more
efficiently now?

Yes, the results look good, but if all this change is doing is
micro-optimising a single code path, it's much less impressive and
far more likley that it has no impact on real-world performance...

More information, please!

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx