On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote: > I really hate making the traditional unix single-threaded file > descriptor case take that lock. > > Maybe it doesn't matter. Obviously it can't have contention, and your > patch in that sense is pretty benign. > > But locking is just fundamentally expensive in the first place, and it > annoys me that I never realized that pidfd_getfd() did that thing that > I knew was broken for /proc. > So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs. I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand. In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s). The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3); if (file && (file->f_mode & FMODE_ATOMIC_POS)) { - v |= FDPUT_POS_UNLOCK; - mutex_lock(&file->f_pos_lock); + if (file_count(file) > 1 || fdget_pos_mutex) { + v |= FDPUT_POS_UNLOCK; + mutex_lock(&file->f_pos_lock); + } } return v; } I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and forth while the bench was running consistently gave aforementioned ~5% difference. perf top claims: [snip] 32.64% [kernel] [k] copyout 10.83% [kernel] [k] entry_SYSCALL_64 6.69% libc.so.6 [.] read 6.16% [kernel] [k] filemap_get_read_batch 5.62% [kernel] [k] filemap_read 3.39% [kernel] [k] __fget_light 2.92% [kernel] [k] mutex_lock <-- oh 2.70% [kernel] [k] mutex_unlock <-- no 2.33% [kernel] [k] vfs_read 2.18% [kernel] [k] _copy_to_iter 1.93% [kernel] [k] atime_needs_update 1.74% [kernel] [k] __fsnotify_parent 1.29% read1_processes [.] testcase [/snip] [note running perf along with the bench changes throughput to some extent] So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read. Consequently eliding this mutex trip would be most welcome. Happy to rant,