Re: [PATCH bpf-next 2/3] bpf, x86: Don't generate lock prefix for BPF_XCHG

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 23 Dec 2023 12:01:04 +0800

Hi,

On 12/21/2023 7:37 PM, Hou Tao wrote:
> Hi,
>
> On 12/21/2023 2:46 AM, Daniel Borkmann wrote:
>> On 12/20/23 7:25 PM, Alexei Starovoitov wrote:
>>> On Wed, Dec 20, 2023 at 6:58 AM Daniel Borkmann
>>> <daniel@xxxxxxxxxxxxx> wrote:
>>>> On 12/19/23 2:56 PM, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>>>>
>>>>> According to the implementation of atomic_xchg() under x86-64, the
>>>>> lock
>>>>> prefix is not necessary for BPF_XCHG atomic operation, so just remove
>>>>> it.
>>>> It's probably a good idea for the commit message to explicitly quote
>>>> the
>>>> Intel docs in here, so it's easier to find on why the lock prefix would
>>>> not be needed for the xchg op.
>>> It's a surprise to me as well.
>>> Definitely more info would be good.
>>>
>>> Also if xchg insn without lock somehow implies lock in the HW
>>> what is the harm of adding it explicitly?
>>> If it's a lock in HW than performance with and without lock prefix
>>> should be the same, right?
>> e.g. 7.3.1.2 Exchange Instructions says:
>>
>>   The XCHG (exchange) instruction swaps the contents of two operands.
>> This
>>   instruction takes the place of three MOV instructions and does not
>> require
>>   a temporary location to save the contents of one operand location while
>>   the other is being loaded. When a memory operand is used with the XCHG
>>   instruction, the processor’s LOCK signal is automatically asserted.
>>
>> Also curious if there is any harm adding it explicitly.
>>
>> .
> I could use the bpf ma benchmark to test it, but I doubt it will make
> any visible difference.
>
> .

The following is the performance comparison between inlined-kptr-xchg()
without and with lock prefix. It seems for helper without-lock-prefix
the peak free performance is indeed larger than the helper with lock
prefix, but it is so tiny that it may be just fluctuations. So I will
keep the JIT of BPF_XCHG() as-is.

(1) inlined bpf_kptr_xchg() without lock prefix:

$ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep
Summary
Summary: per-prod alloc   11.50 ± 0.25M/s free   37.52 ± 0.59M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    8.52 ± 0.14M/s free   34.73 ± 0.46M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc    7.40 ± 0.09M/s free   36.09 ± 0.65M/s, total
memory usage    0.02 ± 0.00MiB
Summary: per-prod alloc    6.62 ± 0.16M/s free   36.48 ± 2.06M/s, total
memory usage    0.07 ± 0.00MiB

$ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary
Summary: per-prod alloc   10.81 ± 0.12M/s free   36.78 ± 0.35M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   10.77 ± 0.07M/s free   35.76 ± 0.37M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc   10.61 ± 0.09M/s free   33.92 ± 0.41M/s, total
memory usage    0.00 ± 0.00MiB

(2) inlined bpf_kptr_xchg() with lock prefix:

$ for i in 1 2 4 8; do ./bench -w3 -d10 bpf_ma -p${i} -a; done | grep
Summary
Summary: per-prod alloc   11.10 ± 0.19M/s free   36.07 ± 0.40M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc    8.56 ± 0.13M/s free   35.74 ± 0.55M/s, total
memory usage    0.01 ± 0.00MiB
Summary: per-prod alloc    7.37 ± 0.05M/s free   35.78 ± 0.64M/s, total
memory usage    0.02 ± 0.00MiB
Summary: per-prod alloc    6.57 ± 0.10M/s free   33.72 ± 0.57M/s, total
memory usage    0.07 ± 0.00MiB

$ for j in $(seq 3); do ./bench -w3 -d10 bpf_ma -p1 -a; done | grep Summary
Summary: per-prod alloc   11.51 ± 0.20M/s free   35.42 ± 0.34M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   11.30 ± 0.08M/s free   34.81 ± 0.35M/s, total
memory usage    0.00 ± 0.00MiB
Summary: per-prod alloc   11.43 ± 0.11M/s free   35.43 ± 0.33M/s, total
memory usage    0.00 ± 0.00MiB