Re: [PATCH bpf-next v5 1/4] bpf: add bpf_get_cpu_cycles kfunc

Vadim Fedorenko <vadim.fedorenko@xxxxxxxxx> · Wed, 13 Nov 2024 00:20:22 +0000

On 13/11/2024 00:09, Alexei Starovoitov wrote:
On Tue, Nov 12, 2024 at 3:08 PM Vadim Fedorenko
<vadim.fedorenko@xxxxxxxxx> wrote:

On 12/11/2024 22:27, Alexei Starovoitov wrote:
On Tue, Nov 12, 2024 at 2:20 PM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:

On Tue, 2024-11-12 at 13:53 -0800, Eduard Zingerman wrote:
On Tue, 2024-11-12 at 21:39 +0000, Vadim Fedorenko wrote:

[...]

+                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+                           imm32 == BPF_CALL_IMM(bpf_get_cpu_cycles)) {
+                               /* Save RDX because RDTSC will use EDX:EAX to return u64 */
+                               emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
+                               if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC))
+                                       EMIT_LFENCE();
+                               EMIT2(0x0F, 0x31);
+
+                               /* shl RDX, 32 */
+                               maybe_emit_1mod(&prog, BPF_REG_3, true);
+                               EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
+                               /* or RAX, RDX */
+                               maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
+                               EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
+                               /* restore RDX from R11 */
+                               emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);

Note: The default implementation of this kfunc uses __arch_get_hw_counter(),
         which is implemented as `(u64)rdtsc_ordered() & S64_MAX`.
         Here we don't do `& S64_MAX`.
         The masking in __arch_get_hw_counter() was added by this commit:
         77750f78b0b3 ("x86/vdso: Fix gettimeofday masking").

I think we already discussed it with Alexey in v1, we don't really need
any masking here for BPF case. We can use values provided by CPU
directly. It will never happen that within one BPF program we will have
inlined and non-inlined implementation of this helper, hence the values
to compare will be of the same source.

         Also, the default implementation does not issue `lfence`.
         Not sure if this makes any real-world difference.

Well, it actually does. rdtsc_ordered is translated into `lfence; rdtsc`
or `rdtscp` (which is rdtsc + lfence + u32 cookie) depending on the cpu
features.

I see the following disassembly:

0000000000008980 <bpf_get_cpu_cycles>:
; {
      8980: f3 0f 1e fa                   endbr64
      8984: e8 00 00 00 00                callq   0x8989 <bpf_get_cpu_cycles+0x9>
                  0000000000008985:  R_X86_64_PLT32       __fentry__-0x4
;       asm volatile(ALTERNATIVE_2("rdtsc",
      8989: 0f 31                         rdtsc
      898b: 90                            nop
      898c: 90                            nop
      898d: 90                            nop
;       return EAX_EDX_VAL(val, low, high);
      898e: 48 c1 e2 20                   shlq    $0x20, %rdx
      8992: 48 09 d0                      orq     %rdx, %rax
      8995: 48 b9 ff ff ff ff ff ff ff 7f movabsq $0x7fffffffffffffff, %rcx # imm = 0x7FFFFFFFFFFFFFFF
;               return (u64)rdtsc_ordered() & S64_MAX;
      899f: 48 21 c8                      andq    %rcx, %rax
;       return __arch_get_hw_counter(1, NULL);
      89a2: 2e e9 00 00 00 00             jmp     0x89a8 <bpf_get_cpu_cycles+0x28>

Is it patched when kernel is loaded to replace nops with lfence?
By real-world difference I meant difference between default
implementation and inlined assembly.

Talked with Vadim off-list, he explained that 'rttsc nop nop nop' is
indeed patched at kernel load. Regarding S64_MAX patching we just hope
this should never be an issue for BPF use-case.
So, no more questions from my side.

since s64 question came up twice it should be a comment.

sure, will do it.

nop nop as well.

do you mean why there are nop;nop instructions in the kernel's assembly?

Explanation on why JITed matches __arch_get_hw_counter.

Got it, will do