Re: [PATCH] file: always lock position

Mateusz Guzik <mjguzik@xxxxxxxxx> · Thu, 3 Aug 2023 17:18:49 +0200



of course failed to attach, remedied in this email :)

On 8/3/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On 8/3/23, Christian Brauner <brauner@xxxxxxxxxx> wrote:
>> On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
>>> On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
>>> > I really hate making the traditional unix single-threaded file
>>> > descriptor case take that lock.
>>> >
>>> > Maybe it doesn't matter. Obviously it can't have contention, and your
>>> > patch in that sense is pretty benign.
>>> >
>>> > But locking is just fundamentally expensive in the first place, and it
>>> > annoys me that I never realized that pidfd_getfd() did that thing that
>>> > I knew was broken for /proc.
>>> >
>>>
>>> So I got curious what the impact is and checked on quite a modern CPU
>>> (Sapphire Rapid), so nobody can claim it's some old yeller and atomics
>>> are nowhere near as expensive on modern uarchs.
>>>
>>> I used read1_processes from will-it-scale -- it is doing 4KB reads at a
>>> time over a 1MB file and dodges refing the file, but it does not dodge
>>> the lock with the patch at hand.
>>>
>>> In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
>>>
>>> The kernel was patched with a toggle to force or elide the proposed
>>> mandatory locking, like so:
>>> @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd)
>>>         struct file *file = (struct file *)(v & ~3);
>>>
>>>         if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>>> -               v |= FDPUT_POS_UNLOCK;
>>> -               mutex_lock(&file->f_pos_lock);
>>> +               if (file_count(file) > 1 || fdget_pos_mutex) {
>>> +                       v |= FDPUT_POS_UNLOCK;
>>> +                       mutex_lock(&file->f_pos_lock);
>>> +               }
>>>         }
>>>         return v;
>>>  }
>>>
>>> I got rather unstable single-threaded perf, going up and down several %
>>> between runs, I don't know yet what's that about. But toggling back and
>>
>> We've had the whole lkp intel performance testsuite run on this for a
>> long time to see whether there any performance regressions that aren't
>> in the noise. This includes all of will-it-scale. Also with a focus on
>> single-thread performance. So I'm a little skeptical about the
>> reliability of manual performance runs in comparison.
>>
>
> I did not merely post numbers claiming a difference, I justified it
> with profile output, prominently showing the mutex lock/unlock pair. I
> genuinely think this puts the ball in your court.
>
> One potential factor is mere CPU(s) -- maybe whatever was used in your
> bench is older than the CPU I used and for example is more negatively
> affected by mitigations like retpoline et al. I intentionally used
> something very high end, specifically: Intel(R) Xeon(R) Platinum 8470N
>
> Another potential factor is discrepancies in kernel config.
>
> I found one here:
> https://lore.kernel.org/oe-lkp/20230802103152.ae3s43z6yjkhnkee@quack3/T/#t
> , I'm guessing it is the same as the one used in Intel tests. In there
> I see:
> CONFIG_RANDOMIZE_KSTACK_OFFSET=y
>
> This is implemented with rdstsc, which comes with a massive premium to
> syscall cost. Also note it does not *fix* CPU bugs or whatever, it
> merely is a hardening measure to shuffle the stack pointer (way
> overpriced btw). Which is a long way of saying I sanity-checked
> syscall perf on my box with getppid and had to disable this opt (afair
> it bumped the rate from about 11 mln to 15 mln ops/s) and that it is
> fine to do so.
>
> I could easily see how more things like this combined with older CPUs
> would make the mutex trip small enough to not worry about, in that
> setting. I don't think this is the same as the cost of the change
> being negligible in general. I can't stress enough my config retains
> retpoline and other mitigations for actual bugs, so it's not like I
> cheated the system.
>
> I'm attaching my config for reference. It is what's present in Debian
> + a bunch of stuff disabled.
>
>> If Linus thinks it matters then I think we should only take the
>> f_pos_lock unconditionally on directory fds as this is where it really
>> matters. For read/write we're only doing it because of posix anyway and
>> using pidfd_getfd() is squarely out of posix anyway. So we can document
>> that using pidfd_getfd() on a regular file descriptor outside of a
>> seccomp stopped task is not recommended.
>>
>
> In stock kernel the code is:
>         if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>                 if (file_count(file) > 1) {
>                         v |= FDPUT_POS_UNLOCK;
>                         mutex_lock(&file->f_pos_lock);
>                 }
>         }
>
> after pidfd_getfd succeeds the count is > 1, so the thing to worry
> about is racing with the thread calling fdget_pos or making sure it
> stopped using the fd, but only the first time around. Perhaps there is
> a way to wait it out (as in guarantee it is not using the fd *or*
> already had seen the bumped refcount) which would be perfectly
> acceptable for pidfd consumers? Note the thread may be stuck in an
> uninterruptible, indefinite sleep somewhere while using it without the
> lock. On the other hand if such a thing can happen with f_pos_lock
> mutex held, the pidfd consumer running into the same codepath with the
> lock would be in the same spot, so perhaps this is not so bad?
>
> Note this is definitely hackable just for fun without pessimizing
> fdget_pos much, but off hand I don't have anything reasonable for
> real-world use. I'm going to sleep on it.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>


-- 
Mateusz Guzik <mjguzik gmail.com>
Attachment:
config

Description: Binary data