Hi, On 12/21/2023 7:37 PM, Hou Tao wrote: > Hi, > > On 12/21/2023 2:46 AM, Daniel Borkmann wrote: >> On 12/20/23 7:25 PM, Alexei Starovoitov wrote: >>> On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann >>> <daniel@xxxxxxxxxxxxx> wrote: >>>> On 12/19/23 2:56 PM, Hou Tao wrote: >>>>> From: Hou Tao <houtao1@xxxxxxxxxx> >>>>> >>>>> According to the implementation of atomic_xchg() under x86-64, the >>>>> lock >>>>> prefix is not necessary for BPF_XCHG atomic operation, so just remove >>>>> it. >>>> It's probably a good idea for the commit message to explicitly quote >>>> the >>>> Intel docs in here, so it's easier to find on why the lock prefix would >>>> not be needed for the xchg op. >>> It's a surprise to me as well. >>> Definitely more info would be good. >>> >>> Also if xchg insn without lock somehow implies lock in the HW >>> what is the harm of adding it explicitly? >>> If it's a lock in HW than performance with and without lock prefix >>> should be the same, right? >> e.g. 7.3.1.2 Exchange Instructions says: >> >> The XCHG (exchange) instruction swaps the contents of two operands. >> This >> instruction takes the place of three MOV instructions and does not >> require >> a temporary location to save the contents of one operand location while >> the other is being loaded. When a memory operand is used with the XCHG >> instruction, the processor’s LOCK signal is automatically asserted. >> >> Also curious if there is any harm adding it explicitly. >> >> . > I could use the bpf ma benchmark to test it, but I doubt it will make > any visible difference. > > . The following is the performance comparison between inlined-kptr-xchg() without and with lock prefix. It seems for helper without-lock-prefix the peak free performance is indeed larger than the helper with lock prefix, but it is so tiny that it may be just fluctuations. So I will keep the JIT of BPF_XCHG() as-is. (1) inlined bpf_kptr_xchg() without lock prefix: $ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep Summary Summary: per-prod alloc 11.50 ± 0.25M/s free 37.52 ± 0.59M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 8.52 ± 0.14M/s free 34.73 ± 0.46M/s, total memory usage 0.01 ± 0.00MiB Summary: per-prod alloc 7.40 ± 0.09M/s free 36.09 ± 0.65M/s, total memory usage 0.02 ± 0.00MiB Summary: per-prod alloc 6.62 ± 0.16M/s free 36.48 ± 2.06M/s, total memory usage 0.07 ± 0.00MiB $ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary Summary: per-prod alloc 10.81 ± 0.12M/s free 36.78 ± 0.35M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 10.77 ± 0.07M/s free 35.76 ± 0.37M/s, total memory usage 0.01 ± 0.00MiB Summary: per-prod alloc 10.61 ± 0.09M/s free 33.92 ± 0.41M/s, total memory usage 0.00 ± 0.00MiB (2) inlined bpf_kptr_xchg() with lock prefix: $ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep Summary Summary: per-prod alloc 11.10 ± 0.19M/s free 36.07 ± 0.40M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 8.56 ± 0.13M/s free 35.74 ± 0.55M/s, total memory usage 0.01 ± 0.00MiB Summary: per-prod alloc 7.37 ± 0.05M/s free 35.78 ± 0.64M/s, total memory usage 0.02 ± 0.00MiB Summary: per-prod alloc 6.57 ± 0.10M/s free 33.72 ± 0.57M/s, total memory usage 0.07 ± 0.00MiB $ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary Summary: per-prod alloc 11.51 ± 0.20M/s free 35.42 ± 0.34M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 11.30 ± 0.08M/s free 34.81 ± 0.35M/s, total memory usage 0.00 ± 0.00MiB Summary: per-prod alloc 11.43 ± 0.11M/s free 35.43 ± 0.33M/s, total memory usage 0.00 ± 0.00MiB