Hi, On 5/17/2024 9:52 PM, Chase Hiltz wrote: > Hi, > > Thanks for the replies. > >> Joe also gave a talk about LRU maps LPC a couple of years ago which >> might give some insight: > Thanks, this was very helpful in understanding how LRU eviction works! > I definitely think it's related to high levels of contention on > individual machines causing LRU eviction to fail, given that I'm only > seeing it occur for those which consistently process the most packets. > >> There have been several updates to the LRU map code since 5.15 so it is >> definitely possible that it will behave differently on a 6.x kernel. > I've compared the implementation between 5.15 and 6.5 (what I would > consider as a potential upgrade) and observed no more than a few > refactoring changes, but of course it's possible that I missed > something. > >> In order to reduce of possibility of ENOMEM error, the right >> way is to increase the value of max_entries instead of decreasing it. > Yes, I now see the error of my ways in thinking that reducing it would > help at all when it actually hurts. For the time being, I'm going to > do this as a temporary remediation. Is there a special reason on why use >> Does the specific CPU always fail afterwards, or does it fail >> periodically ? Is the machine running the bpf program an arm64 host or >> an x86-64 host (namely uname -a) ? I suspect that the problem may be due >> to htab_lock_bucket() which may fail under arm64 host in v5.15 > It always fails afterwards, I'm doing RSS and we notice this problem > occurring back-to-back for specific source-destination pairs (because > they always land on the same queue). This is a 64-bit system: > ``` > $ uname -a > 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 > x86_64 x86_64 GNU/Linux > ``` It is an x86-64 host, so my previous guess is wrong. > >> Could you please check and account the ratio of times when >> htab_lru_map_delete_node() returns 0 ? If the ratio high, it probably >> means that there may be too many overwrites of entries between different >> CPUs (e.g., CPU 0 updates key=X, then CPU 1 updates the same key again). > I'm not aware of any way to get that information, if you have any > pointers I'd be happy to check this. Please install bpftrace on the host firstly, then running the following one-line script in the host when bpf_map_update_elem() starts to return -ENOMEM: # sudo bpftrace -e 'kr:htab_lru_map_delete_node { if (retval == 0) { @lock[cpu] = count(); } else { @del[retval & 0xff, cpu] = count(); } } i:s:10 { exit(); }' The script above tries to account the return value of htab_lru_map_delete_node(): (1) if htab_lock_bucket() returns true, retval will 0, so account the case in the @lock map (2) if the target node is found in the hash list, the lowest byte of retval will be 1, otherwise it will 0. These returns are accounted in @del map. The snippet 'i:s:10 { exit(); }' is used to terminate the script after 10 seconds. You could adjust the time to a smaller one if there are too many accounting. The following is the output from my local developer environment: # bpftrace -e 'kr:htab_lru_map_delete_node { if (retval == 0) { @lock[cpu] = count(); } else { @del[retval & 0xff, cpu] = count(); } } i:s:10 { exit(); }' Attaching 2 probes... @del[0, 3]: 4822 @del[0, 6]: 5656 @del[0, 2]: 5995 @del[0, 4]: 8652 @del[0, 1]: 24722 @del[0, 5]: 25146 @del[0, 0]: 36137 @del[0, 7]: 38254 @del[1, 3]: 162054 @del[1, 4]: 208696 @del[1, 6]: 245960 @del[1, 2]: 267437 @del[1, 5]: 533654 @del[1, 1]: 548974 @del[1, 7]: 618810 @del[1, 0]: 619459 > > > On Thu, 16 May 2024 at 07:29, Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> +cc bpf list >> >> On 5/6/2024 11:19 PM, Chase Hiltz wrote: >>> Hi, >>> >>> I'm writing regarding a rather bizarre scenario that I'm hoping >>> someone could provide insight on. I have a map defined as follows: >>> ``` >>> struct { >>> __uint(type, BPF_MAP_TYPE_LRU_HASH); >>> __uint(max_entries, 1000000); >>> __type(key, struct my_map_key); >>> __type(value, struct my_map_val); >>> __uint(map_flags, BPF_F_NO_COMMON_LRU); >>> __uint(pinning, LIBBPF_PIN_BY_NAME); >>> } my_map SEC(".maps"); >>> ``` >>> I have several fentry/fexit programs that need to perform updates in >>> this map. After a certain number of map entries has been reached, >>> calls to bpf_map_update_elem start returning `-ENOMEM`. As one >>> example, I'm observing a program deployment where we have 816032 >>> entries on a 64 CPU machine, and a certain portion of updates are >>> failing. I'm puzzled as to why this is occurring given that: >>> - The 1M entries should be preallocated upon map creation (since I'm >>> not using `BPF_F_NO_PREALLOC`) >>> - The host machine has over 120G of unused memory available at any given time >>> >>> I've previously reduced max_entries by 25% under the assumption that >>> this would prevent the problem from occurring, but this only caused >> For LRU map with BPF_F_NO_PREALLOC, the number of entries is distributed >> evenly between all CPUs. For your case, each CPU will have 1M/64 = 15625 >> entries. In order to reduce of possibility of ENOMEM error, the right >> way is to increase the value of max_entries instead of decreasing it. >>> map updates to start failing at a lower threshold. I believe that this >>> is a problem with maps using the `BPF_F_NO_COMMON_LRU` flag, my >>> reasoning being that when map updates fail, it occurs consistently for >>> specific CPUs. >> Does the specific CPU always fail afterwards, or does it fail >> periodically ? Is the machine running the bpf program an arm64 host or >> an x86-64 host (namely uname -a) ? I suspect that the problem may be due >> to htab_lock_bucket() which may fail under arm64 host in v5.15. >> >> Could you please check and account the ratio of times when >> htab_lru_map_delete_node() returns 0 ? If the ratio high, it probably >> means that there may be too many overwrites of entries between different >> CPUs (e.g., CPU 0 updates key=X, then CPU 1 updates the same key again). >>> At this time, all machines experiencing the problem are running kernel >>> version 5.15, however I'm not currently able to try out any newer >>> kernels to confirm whether or not the same problem occurs there. Any >>> ideas on what could be responsible for this would be greatly >>> appreciated! >>> >>> Thanks, >>> Chase Hiltz >>> >>> . >