Re: [PATCH] file: always lock position

Mateusz Guzik <mjguzik@xxxxxxxxx> · Thu, 3 Aug 2023 11:53:11 +0200




On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
> I really hate making the traditional unix single-threaded file
> descriptor case take that lock.
> 
> Maybe it doesn't matter. Obviously it can't have contention, and your
> patch in that sense is pretty benign.
> 
> But locking is just fundamentally expensive in the first place, and it
> annoys me that I never realized that pidfd_getfd() did that thing that
> I knew was broken for /proc.
> 

So I got curious what the impact is and checked on quite a modern CPU
(Sapphire Rapid), so nobody can claim it's some old yeller and atomics
are nowhere near as expensive on modern uarchs.

I used read1_processes from will-it-scale -- it is doing 4KB reads at a
time over a 1MB file and dodges refing the file, but it does not dodge
the lock with the patch at hand.

In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).

The kernel was patched with a toggle to force or elide the proposed
mandatory locking, like so:
@@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd)
        struct file *file = (struct file *)(v & ~3);

        if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
-               v |= FDPUT_POS_UNLOCK;
-               mutex_lock(&file->f_pos_lock);
+               if (file_count(file) > 1 || fdget_pos_mutex) {
+                       v |= FDPUT_POS_UNLOCK;
+                       mutex_lock(&file->f_pos_lock);
+               }
        }
        return v;
 }

I got rather unstable single-threaded perf, going up and down several %
between runs, I don't know yet what's that about. But toggling back and
forth while the bench was running consistently gave aforementioned ~5%
difference.

perf top claims:
[snip]
  32.64%  [kernel]                  [k] copyout
  10.83%  [kernel]                  [k] entry_SYSCALL_64
   6.69%  libc.so.6                 [.] read
   6.16%  [kernel]                  [k] filemap_get_read_batch
   5.62%  [kernel]                  [k] filemap_read
   3.39%  [kernel]                  [k] __fget_light
   2.92%  [kernel]                  [k] mutex_lock	<-- oh
   2.70%  [kernel]                  [k] mutex_unlock	<-- no
   2.33%  [kernel]                  [k] vfs_read
   2.18%  [kernel]                  [k] _copy_to_iter
   1.93%  [kernel]                  [k] atime_needs_update
   1.74%  [kernel]                  [k] __fsnotify_parent
   1.29%  read1_processes           [.] testcase
[/snip]

[note running perf along with the bench changes throughput to some
extent]

So yes, atomics remain expensive on x86-64 even on a very moden uarch
and their impact is measurable in a syscall like read.

Consequently eliding this mutex trip would be most welcome.

Happy to rant,