Re: [PATCH] file: always lock position

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/3/23, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
>> On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
>> > I really hate making the traditional unix single-threaded file
>> > descriptor case take that lock.
>> >
>> > Maybe it doesn't matter. Obviously it can't have contention, and your
>> > patch in that sense is pretty benign.
>> >
>> > But locking is just fundamentally expensive in the first place, and it
>> > annoys me that I never realized that pidfd_getfd() did that thing that
>> > I knew was broken for /proc.
>> >
>>
>> So I got curious what the impact is and checked on quite a modern CPU
>> (Sapphire Rapid), so nobody can claim it's some old yeller and atomics
>> are nowhere near as expensive on modern uarchs.
>>
>> I used read1_processes from will-it-scale -- it is doing 4KB reads at a
>> time over a 1MB file and dodges refing the file, but it does not dodge
>> the lock with the patch at hand.
>>
>> In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
>>
>> The kernel was patched with a toggle to force or elide the proposed
>> mandatory locking, like so:
>> @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd)
>>         struct file *file = (struct file *)(v & ~3);
>>
>>         if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>> -               v |= FDPUT_POS_UNLOCK;
>> -               mutex_lock(&file->f_pos_lock);
>> +               if (file_count(file) > 1 || fdget_pos_mutex) {
>> +                       v |= FDPUT_POS_UNLOCK;
>> +                       mutex_lock(&file->f_pos_lock);
>> +               }
>>         }
>>         return v;
>>  }
>>
>> I got rather unstable single-threaded perf, going up and down several %
>> between runs, I don't know yet what's that about. But toggling back and
>
> We've had the whole lkp intel performance testsuite run on this for a
> long time to see whether there any performance regressions that aren't
> in the noise. This includes all of will-it-scale. Also with a focus on
> single-thread performance. So I'm a little skeptical about the
> reliability of manual performance runs in comparison.
>

I did not merely post numbers claiming a difference, I justified it
with profile output, prominently showing the mutex lock/unlock pair. I
genuinely think this puts the ball in your court.

One potential factor is mere CPU(s) -- maybe whatever was used in your
bench is older than the CPU I used and for example is more negatively
affected by mitigations like retpoline et al. I intentionally used
something very high end, specifically: Intel(R) Xeon(R) Platinum 8470N

Another potential factor is discrepancies in kernel config.

I found one here:
https://lore.kernel.org/oe-lkp/20230802103152.ae3s43z6yjkhnkee@quack3/T/#t
, I'm guessing it is the same as the one used in Intel tests. In there
I see:
CONFIG_RANDOMIZE_KSTACK_OFFSET=y

This is implemented with rdstsc, which comes with a massive premium to
syscall cost. Also note it does not *fix* CPU bugs or whatever, it
merely is a hardening measure to shuffle the stack pointer (way
overpriced btw). Which is a long way of saying I sanity-checked
syscall perf on my box with getppid and had to disable this opt (afair
it bumped the rate from about 11 mln to 15 mln ops/s) and that it is
fine to do so.

I could easily see how more things like this combined with older CPUs
would make the mutex trip small enough to not worry about, in that
setting. I don't think this is the same as the cost of the change
being negligible in general. I can't stress enough my config retains
retpoline and other mitigations for actual bugs, so it's not like I
cheated the system.

I'm attaching my config for reference. It is what's present in Debian
+ a bunch of stuff disabled.

> If Linus thinks it matters then I think we should only take the
> f_pos_lock unconditionally on directory fds as this is where it really
> matters. For read/write we're only doing it because of posix anyway and
> using pidfd_getfd() is squarely out of posix anyway. So we can document
> that using pidfd_getfd() on a regular file descriptor outside of a
> seccomp stopped task is not recommended.
>

In stock kernel the code is:
        if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
                if (file_count(file) > 1) {
                        v |= FDPUT_POS_UNLOCK;
                        mutex_lock(&file->f_pos_lock);
                }
        }

after pidfd_getfd succeeds the count is > 1, so the thing to worry
about is racing with the thread calling fdget_pos or making sure it
stopped using the fd, but only the first time around. Perhaps there is
a way to wait it out (as in guarantee it is not using the fd *or*
already had seen the bumped refcount) which would be perfectly
acceptable for pidfd consumers? Note the thread may be stuck in an
uninterruptible, indefinite sleep somewhere while using it without the
lock. On the other hand if such a thing can happen with f_pos_lock
mutex held, the pidfd consumer running into the same codepath with the
lock would be in the same spot, so perhaps this is not so bad?

Note this is definitely hackable just for fun without pessimizing
fdget_pos much, but off hand I don't have anything reasonable for
real-world use. I'm going to sleep on it.

-- 
Mateusz Guzik <mjguzik gmail.com>



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux