Re: bpf_map_update_elem returns -ENOMEM

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 18 May 2024 12:32:31 +0800

Hi,

On 5/17/2024 9:52 PM, Chase Hiltz wrote:
> Hi,
>
> Thanks for the replies.
>
>> Joe also gave a talk about LRU maps LPC a couple of years ago which
>> might give some insight:
> Thanks, this was very helpful in understanding how LRU eviction works!
> I definitely think it's related to high levels of contention on
> individual machines causing LRU eviction to fail, given that I'm only
> seeing it occur for those which consistently process the most packets.
>
>> There have been several updates to the LRU map code since 5.15 so it is
>> definitely possible that it will behave differently on a 6.x kernel.
> I've compared the implementation between 5.15 and 6.5 (what I would
> consider as a potential upgrade) and observed no more than a few
> refactoring changes, but of course it's possible that I missed
> something.
>
>> In order to reduce of possibility of ENOMEM error, the right
>> way is to increase the value of max_entries instead of decreasing it.
> Yes, I now see the error of my ways in thinking that reducing it would
> help at all when it actually hurts. For the time being, I'm going to
> do this as a temporary remediation.

Is there a special reason on why use
>> Does the specific CPU always fail afterwards, or does it fail
>> periodically ? Is the machine running the bpf program an arm64 host or
>> an x86-64 host (namely uname -a) ? I suspect that the problem may be due
>> to htab_lock_bucket() which may fail under arm64 host in v5.15
> It always fails afterwards, I'm doing RSS and we notice this problem
> occurring back-to-back for specific source-destination pairs (because
> they always land on the same queue). This is a 64-bit system:
> ```
> $ uname -a
> 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64
> x86_64 x86_64 GNU/Linux
> ```

It is an x86-64 host, so my previous guess is wrong.
>
>> Could you please check and account the ratio of times when
>> htab_lru_map_delete_node() returns 0 ? If the ratio high, it probably
>> means that there may be too many overwrites of entries between different
>> CPUs (e.g., CPU 0 updates key=X, then CPU 1 updates the same key again).
> I'm not aware of any way to get that information, if you have any
> pointers I'd be happy to check this.

Please install bpftrace on the host firstly, then running the following
one-line script in the host when bpf_map_update_elem() starts to return
-ENOMEM:

# sudo bpftrace -e 'kr:htab_lru_map_delete_node { if (retval == 0) {
@lock[cpu] = count(); } else { @del[retval & 0xff, cpu] = count(); } }
i:s:10 { exit(); }'

The script above tries to account the return value of
htab_lru_map_delete_node():
(1) if htab_lock_bucket() returns true,  retval will 0, so account the
case in the @lock map
(2) if the target node is found in the hash list, the lowest byte of
retval will be 1, otherwise it will 0. These returns are accounted in
@del map.

The snippet 'i:s:10 { exit(); }' is used to terminate the script after
10 seconds. You could adjust the time to a smaller one if there are too
many accounting. The following is the output from my local developer
environment:

# bpftrace -e 'kr:htab_lru_map_delete_node { if (retval == 0) {
@lock[cpu] = count(); } else { @del[retval & 0xff, cpu] = count(); } }
i:s:10 { exit(); }'
Attaching 2 probes...

@del[0, 3]: 4822
@del[0, 6]: 5656
@del[0, 2]: 5995
@del[0, 4]: 8652
@del[0, 1]: 24722
@del[0, 5]: 25146
@del[0, 0]: 36137
@del[0, 7]: 38254
@del[1, 3]: 162054
@del[1, 4]: 208696
@del[1, 6]: 245960
@del[1, 2]: 267437
@del[1, 5]: 533654
@del[1, 1]: 548974
@del[1, 7]: 618810
@del[1, 0]: 619459

>
>
> On Thu, 16 May 2024 at 07:29, Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> +cc bpf list
>>
>> On 5/6/2024 11:19 PM, Chase Hiltz wrote:
>>> Hi,
>>>
>>> I'm writing regarding a rather bizarre scenario that I'm hoping
>>> someone could provide insight on. I have a map defined as follows:
>>> ```
>>> struct {
>>>     __uint(type, BPF_MAP_TYPE_LRU_HASH);
>>>     __uint(max_entries, 1000000);
>>>     __type(key, struct my_map_key);
>>>     __type(value, struct my_map_val);
>>>     __uint(map_flags, BPF_F_NO_COMMON_LRU);
>>>     __uint(pinning, LIBBPF_PIN_BY_NAME);
>>> } my_map SEC(".maps");
>>> ```
>>> I have several fentry/fexit programs that need to perform updates in
>>> this map. After a certain number of map entries has been reached,
>>> calls to bpf_map_update_elem start returning `-ENOMEM`. As one
>>> example, I'm observing a program deployment where we have 816032
>>> entries on a 64 CPU machine, and a certain portion of updates are
>>> failing. I'm puzzled as to why this is occurring given that:
>>> - The 1M entries should be preallocated upon map creation (since I'm
>>> not using `BPF_F_NO_PREALLOC`)
>>> - The host machine has over 120G of unused memory available at any given time
>>>
>>> I've previously reduced max_entries by 25% under the assumption that
>>> this would prevent the problem from occurring, but this only caused
>> For LRU map with BPF_F_NO_PREALLOC, the number of entries is distributed
>> evenly between all CPUs. For your case, each CPU will have 1M/64 = 15625
>> entries. In order to reduce of possibility of ENOMEM error, the right
>> way is to increase the value of max_entries instead of decreasing it.
>>> map updates to start failing at a lower threshold. I believe that this
>>> is a problem with maps using the `BPF_F_NO_COMMON_LRU` flag, my
>>> reasoning being that when map updates fail, it occurs consistently for
>>> specific CPUs.
>> Does the specific CPU always fail afterwards, or does it fail
>> periodically ? Is the machine running the bpf program an arm64 host or
>> an x86-64 host (namely uname -a) ? I suspect that the problem may be due
>> to htab_lock_bucket() which may fail under arm64 host in v5.15.
>>
>> Could you please check and account the ratio of times when
>> htab_lru_map_delete_node() returns 0 ? If the ratio high, it probably
>> means that there may be too many overwrites of entries between different
>> CPUs (e.g., CPU 0 updates key=X, then CPU 1 updates the same key again).
>>> At this time, all machines experiencing the problem are running kernel
>>> version 5.15, however I'm not currently able to try out any newer
>>> kernels to confirm whether or not the same problem occurs there. Any
>>> ideas on what could be responsible for this would be greatly
>>> appreciated!
>>>
>>> Thanks,
>>> Chase Hiltz
>>>
>>> .
>