Re: [PATCH v2] mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Sat, 20 Apr 2024 12:05:04 +0800

On 2024/4/19 23:17, Peter Xu wrote:
On Fri, Apr 19, 2024 at 11:00:46AM +0800, Kefeng Wang wrote:

On 2024/4/19 0:32, Peter Xu wrote:
Hi, Kefeng,

On Thu, Apr 18, 2024 at 08:06:41PM +0800, Kefeng Wang wrote:
Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the
unnecessary pte_marker_entry_uffd_wp() in most pagefault, difference
as shows below from perf data of lat_pagefault, note, the function
vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions.

    perf report -i perf.data.before | grep vmf
       0.17%     0.13%  lat_pagefault  [kernel.kallsyms]      [k] vmf_orig_pte_uffd_wp.part.0.isra.0
    perf report -i perf.data.after  | grep vmf

Any real number to share too besides the perf greps?  I meant, even if perf
report will not report such function anymore, it doesn't mean it'll be
faster, and how much it improves?

dd if=/dev/zero of=/tmp/XXX bs=512M count=1
./lat_pagefault -W 5 -N 5 /tmp/XXX

	before		after	
1	0.2623		0.2605	
2	0.2622		0.2598	
3	0.2621		0.2595	
4	0.2622		0.2600	
5	0.2651		0.2598	
6	0.2624		0.2594	
7	0.2624		0.2605	
8	0.2627		0.2608	
average	0.262675	0.2600375	-0.0026375

The lat_pagefault does show some improvement(also I reboot and retest,
the results are same).

Thanks. Could you replace the perf report with these real data?  Or just
append to it.

Sure, will append it.

I had a look at the asm and indeed the current code will generate two
jumps when without this patch, and I don't know why..

    0x0000000000006ac4 <+52>:    test   $0x8,%ah                 <---- check FAULT_FLAG_ORIG_PTE_VALID
    0x0000000000006ac7 <+55>:    jne    0x6bcf <set_pte_range+319>
    0x0000000000006acd <+61>:    mov    0x18(%rbp),%rsi

    ...

    0x0000000000006bcf <+319>:   mov    0x40(%rdi),%rdi
    0x0000000000006bd3 <+323>:   test   $0xffffffffffffff9f,%rdi <---- pte_none() check
    0x0000000000006bda <+330>:   je     0x6acd <set_pte_range+61>
    0x0000000000006be0 <+336>:   test   $0x101,%edi              <---- pte_present() check
    0x0000000000006be6 <+342>:   jne    0x6acd <set_pte_range+61>
    0x0000000000006bec <+348>:   call   0x1c50 <pte_to_swp_entry>
    0x0000000000006bf1 <+353>:   mov    0x0(%rip),%rdx        # 0x6bf8 <set_pte_range+360>
    0x0000000000006bf8 <+360>:   mov    %rax,%r15
    0x0000000000006bfb <+363>:   shr    $0x3a,%rax
    0x0000000000006bff <+367>:   cmp    $0x1f,%rax
    0x0000000000006c03 <+371>:   mov    $0x0,%eax
    0x0000000000006c08 <+376>:   cmovne %rax,%r15
    0x0000000000006c0c <+380>:   mov    0x28(%rbx),%eax
    0x0000000000006c0f <+383>:   and    $0x1,%r15d
    0x0000000000006c13 <+387>:   jmp    0x6acd <set_pte_range+61>

I also don't know why the compiler cannot already merge the none+present
check into one shot, I thought it could.  Also surprised me that
pte_to_swp_entry() is a function call.. but not involved in this context.

So I think I was right it should bypass this when seeing it pte_none,
however that includes two jumps.

And with your patch applied the two jumps are not there:

    0x0000000000006b0c <+124>:   testb  $0x8,0x29(%r14)           <--- FAULT_FLAG_ORIG_PTE_VALID
    0x0000000000006b11 <+129>:   je     0x6b6a <set_pte_range+218>
    0x0000000000006b13 <+131>:   mov    (%r14),%rax
    0x0000000000006b16 <+134>:   testb  $0x10,0x21(%rax)          <--- userfaultfd_wp(vmf->vma) check
    0x0000000000006b1a <+138>:   je     0x6b6a <set_pte_range+218>

Maybe that's what contributes to that 0.x% extra time of a fault.

So if we do care about this 0.x% and we're doing this anyway, perhaps move

The latency of lat_pagefault increased a lot than the old kernel(vs 
5.10), except mm counter updating, the another obvious difference
shown from perf graph is the new vmf_orig_pte_uffd_wp().

the vma check upper?  Because afaict FAULT_FLAG_ORIG_PTE_VALID should
always hit in set_pte_range(), so we can avoid two more insts in the common
paths.

Moving it upper is better, and maybe add __always_inline to 
vmf_orig_pte_uffd_wp() to make set_pte_range() only check VM_UFFD_WP 
from vm_flags?

I'll leave that to you too if you want to mention some details in above and
add that into the commit log.

Will update the changelog, thanks.

Thanks,