Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too

Mateusz Guzik <mjguzik@xxxxxxxxx> · Thu, 15 Aug 2024 17:07:29 +0200

On Tue, Aug 6, 2024 at 10:47 PM Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
>
> > Before I get to the vfs layer, there is a significant loss in the
> > memory allocator because of memcg -- it takes several irq off/on trips
> > for every alloc (needed to grab struct file *). I have a plan what to
> > do with it (handle stuff with local cmpxchg (note no lock prefix)),
> > which I'm trying to get around to. Apart from that you may note the
> > allocator fast path performs a 16-byte cmpxchg, which is again dog
> > slow and executes twice (once for the file obj, another time for the
> > namei buffer). Someone(tm) should patch it up and I have some vague
> > ideas, but 0 idea when I can take a serious stab.
>
> I just LBR sampled it on my skylake and it doesn't look
> particularly slow. You see the whole massive block including CMPXCHG16
> gets IPC 2.7, which is rather good. If you see lots of cycles on it it's likely
> a missing cache line.
>
>     kmem_cache_free:
>         ffffffff9944ce20                        nop %edi, %edx
>         ffffffff9944ce24                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ce29                        pushq  %rbp
>         ffffffff9944ce2a                        mov %rdi, %rdx
>         ffffffff9944ce2d                        mov %rsp, %rbp
>         ffffffff9944ce30                        pushq  %r15
>         ffffffff9944ce32                        pushq  %r14
>         ffffffff9944ce34                        pushq  %r13
>         ffffffff9944ce36                        pushq  %r12
>         ffffffff9944ce38                        mov $0x80000000, %r12d
>         ffffffff9944ce3e                        pushq  %rbx
>         ffffffff9944ce3f                        mov %rsi, %rbx
>         ffffffff9944ce42                        and $0xfffffffffffffff0, %rsp
>         ffffffff9944ce46                        sub $0x10, %rsp
>         ffffffff9944ce4a                        movq  %gs:0x28, %rax
>         ffffffff9944ce53                        movq  %rax, 0x8(%rsp)
>         ffffffff9944ce58                        xor %eax, %eax
>         ffffffff9944ce5a                        add %rsi, %r12
>         ffffffff9944ce5d                        jb 0xffffffff9944d1ea
>         ffffffff9944ce63                        mov $0xffffffff80000000, %rax
>         ffffffff9944ce6a                        xor %r13d, %r13d
>         ffffffff9944ce6d                        subq  0x17b068c(%rip), %rax
>         ffffffff9944ce74                        add %r12, %rax
>         ffffffff9944ce77                        shr $0xc, %rax
>         ffffffff9944ce7b                        shl $0x6, %rax
>         ffffffff9944ce7f                        addq  0x17b066a(%rip), %rax
>         ffffffff9944ce86                        movq  0x8(%rax), %rcx
>         ffffffff9944ce8a                        test $0x1, %cl
>         ffffffff9944ce8d                        jnz 0xffffffff9944d15c
>         ffffffff9944ce93                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ce98                        movq  (%rax), %rcx
>         ffffffff9944ce9b                        and $0x8, %ch
>         ffffffff9944ce9e                        jz 0xffffffff9944cfea
>         ffffffff9944cea4                        test %rax, %rax
>         ffffffff9944cea7                        jz 0xffffffff9944cfea
>         ffffffff9944cead                        movq  0x8(%rax), %r14
>         ffffffff9944ceb1                        test %r14, %r14
>         ffffffff9944ceb4                        jz 0xffffffff9944cfac
>         ffffffff9944ceba                        cmp %r14, %rdx
>         ffffffff9944cebd                        jnz 0xffffffff9944d165
>         ffffffff9944cec3                        test %r14, %r14
>         ffffffff9944cec6                        jz 0xffffffff9944cfac
>         ffffffff9944cecc                        movq  0x8(%rbp), %r15
>         ffffffff9944ced0                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944ced5                        movq  0x1fe5134(%rip), %rax
>         ffffffff9944cedc                        test %r13, %r13
>         ffffffff9944cedf                        jnz 0xffffffff9944ceef
>         ffffffff9944cee1                        mov $0xffffffff80000000, %rax
>         ffffffff9944cee8                        subq  0x17b0611(%rip), %rax
>         ffffffff9944ceef                        add %rax, %r12
>         ffffffff9944cef2                        shr $0xc, %r12
>         ffffffff9944cef6                        shl $0x6, %r12
>         ffffffff9944cefa                        addq  0x17b05ef(%rip), %r12
>         ffffffff9944cf01                        movq  0x8(%r12), %rax
>         ffffffff9944cf06                        mov %r12, %r13
>         ffffffff9944cf09                        test $0x1, %al
>         ffffffff9944cf0b                        jnz 0xffffffff9944d1b1
>         ffffffff9944cf11                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944cf16                        movq  (%r13), %rax
>         ffffffff9944cf1a                        movq  %rbx, (%rsp)
>         ffffffff9944cf1e                        test $0x8, %ah
>         ffffffff9944cf21                        mov $0x0, %eax
>         ffffffff9944cf26                        cmovz %rax, %r13
>         ffffffff9944cf2a                        data16 nop
>         ffffffff9944cf2c                        movq  0x38(%r13), %r8
>         ffffffff9944cf30                        cmp $0x3, %r8
>         ffffffff9944cf34                        jnbe 0xffffffff9944d1ca
>         ffffffff9944cf3a                        nopl  %eax, (%rax,%rax,1)
>         ffffffff9944cf3f                        movq  0x23d6f72(%rip), %rax
>         ffffffff9944cf46                        mov %rbx, %rdx
>         ffffffff9944cf49                        sub %rax, %rdx
>         ffffffff9944cf4c                        cmp $0x1fffff, %rdx
>         ffffffff9944cf53                        jbe 0xffffffff9944d03a
>         ffffffff9944cf59                        movq  (%r14), %rax
>         ffffffff9944cf5c                        addq  %gs:0x66bccab4(%rip), %rax
>         ffffffff9944cf64                        movq  0x8(%rax), %rdx
>         ffffffff9944cf68                        cmpq  %r13, 0x10(%rax)
>         ffffffff9944cf6c                        jnz 0xffffffff9944d192
>         ffffffff9944cf72                        movl  0x28(%r14), %ecx
>         ffffffff9944cf76                        movq  (%rax), %rax
>         ffffffff9944cf79                        add %rbx, %rcx
>         ffffffff9944cf7c                        cmp %rbx, %rax
>         ffffffff9944cf7f                        jz 0xffffffff9944d1ba
>         ffffffff9944cf85                        movq  0xb8(%r14), %rsi
>         ffffffff9944cf8c                        mov %rcx, %rdi
>         ffffffff9944cf8f                        bswap %rdi
>         ffffffff9944cf92                        xor %rax, %rsi
>         ffffffff9944cf95                        xor %rdi, %rsi
>         ffffffff9944cf98                        movq  %rsi, (%rcx)
>         ffffffff9944cf9b                        leaq  0x2000(%rdx), %rcx
>         ffffffff9944cfa2                        movq  (%r14), %rsi
>         ffffffff9944cfa5                        cmpxchg16bx  %gs:(%rsi)
>         ffffffff9944cfaa                        jnz 0xffffffff9944cf59
>         ffffffff9944cfac                        movq  0x8(%rsp), %rax
>         ffffffff9944cfb1                        subq  %gs:0x28, %rax
>         ffffffff9944cfba                        jnz 0xffffffff9944d1fc
>         ffffffff9944cfc0                        leaq  -0x28(%rbp), %rsp
>         ffffffff9944cfc4                        popq  %rbx
>         ffffffff9944cfc5                        popq  %r12
>         ffffffff9944cfc7                        popq  %r13
>         ffffffff9944cfc9                        popq  %r14
>         ffffffff9944cfcb                        popq  %r15
>         ffffffff9944cfcd                        popq  %rbp
>         ffffffff9944cfce                        retq                            # PRED 38 cycles [126] 2.74 IPC    <-------------

Sorry for late reply, my test box was temporarily unavailable and then
I forgot about this e-mail :)

I don't have a good scientific test(tm) and I don't think coming up
with one is warranted at the moment.

But to illustrate, I slapped together a test case for will-it-scale
where I either cmpxchg8 or 16 in a loop. No lock prefix on these.

On Sapphire Rapids I see well over twice the throughput for the 8-byte variant:

# ./cmpxchg8_processes
warmup
min:481465497 max:481465497 total:481465497
min:464439645 max:464439645 total:464439645
min:461884735 max:461884735 total:461884735
min:460850043 max:460850043 total:460850043
min:461066452 max:461066452 total:461066452
min:463984473 max:463984473 total:463984473
measurement
min:461317703 max:461317703 total:461317703
min:458608942 max:458608942 total:458608942
min:460846336 max:460846336 total:460846336
[snip]

# ./cmpxchg16b_processes
warmup
min:205207128 max:205207128 total:205207128
min:205010535 max:205010535 total:205010535
min:204877781 max:204877781 total:204877781
min:204163814 max:204163814 total:204163814
min:204392000 max:204392000 total:204392000
min:204094222 max:204094222 total:204094222
measurement
min:204243282 max:204243282 total:204243282
min:204136589 max:204136589 total:204136589
min:203504119 max:203504119 total:203504119

So I would say trying it out in a real alloc is worth looking at.

Of course the 16-byte variant is not used just for kicks, so going to
8 bytes is more involved than just replacing the instruction.

The current code follows the standard idea on how to deal with the ABA
problem -- apart from replacing a pointer you validate this is what
you thought by checking the counter in the same instruction.

I note that in the kernel we can do better, but I don't have have all
kinks worked out yet. The core idea builds on the fact that we can
cheaply detect a pending alloc on the same cpu and should a
conflicting free be executing from an interrupt, it can instead add
the returning buffer to a different list and the aba problem
disappears. Should the alloc fast path fail to find a free buffer, it
can disable interrupts an take a look at the fallback list.

-- 
Mateusz Guzik <mjguzik gmail.com>