Re: [robertosassu:evm-iint-ptr-v1-devel-v3] [evm] e38e699a42: will-it-scale.per_process_ops 160.4% improvement

Mateusz Guzik <mjguzik@xxxxxxxxx> · Mon, 17 Feb 2025 08:58:08 +0100

On Mon, Feb 17, 2025 at 02:45:23PM +0800, kernel test robot wrote:
> kernel test robot noticed a 160.4% improvement of will-it-scale.per_process_ops on:
> 
> 
> commit: e38e699a42b4db5daf7dac453759fdc8ba0dab31 ("evm: Move metadata in the inode security blob to a pointer")
> https://github.com/robertosassu/linux evm-iint-ptr-v1-devel-v3
> 
>      24.57           +72.2       96.76        perf-profile.children.cycles-pp.__fput
>      24.63           +72.4       97.02        perf-profile.children.cycles-pp.__x64_sys_close
>      24.67           +72.5       97.17        perf-profile.children.cycles-pp.__close
>       0.00           +95.2       95.17        perf-profile.children.cycles-pp.osq_lock
>       0.00           +95.5       95.53        perf-profile.children.cycles-pp.__mutex_lock
>       0.00           +95.7       95.71        perf-profile.children.cycles-pp.evm_file_release
>       0.00           +95.9       95.87        perf-profile.children.cycles-pp.security_file_release
>      98.69           -98.7        0.00        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

Contrary to what's indicated in the report, this change is in fact a
significant slowdown (or rather, will be when other problems get fixed).

The open3 microbenchmark issues open + close in a loop on the same file.

On the stock kernel the some of the problem is false-sharing within
struct inode.

The biggest bottleneck is lockref manipulation:
- there is lockref acquire and release happening *twice* instead of just
  once
- the lockref facility is prone to degrading to operation under a
  spinlock and staying there when microbenchmarked. you can see on the
  profile this does happen here

evm also used to pop up, which I patched away in 699ae6241920b0fa ("evm:
stop avoidably reading i_writecount in evm_file_release")

Your patch adds a mutex which adds 2 atomics to the fast path (so slows
down single-threaded operation) and more importantly adds a
serialization point for multithreaded operation.

In this case the resulting contention helps decrease the loss of
performance in lockref and that's how there is an apparent win.

I have a WIP patch to move dentries away from using lockref, which will
in turn avoid the degradation. Should it land, the mutex added here will
be the new bottleneck.

It needs to be avoided by default. Do you *need* to test the condition
in evm_file_release() with the lock held? Perhaps initial test can be
done without and redone after acquiring it?