Chase Hiltz <chase@xxxxxxxx> writes: > Hi, > > I'm writing regarding a rather bizarre scenario that I'm hoping > someone could provide insight on. I have a map defined as follows: > ``` > struct { > __uint(type, BPF_MAP_TYPE_LRU_HASH); > __uint(max_entries, 1000000); > __type(key, struct my_map_key); > __type(value, struct my_map_val); > __uint(map_flags, BPF_F_NO_COMMON_LRU); > __uint(pinning, LIBBPF_PIN_BY_NAME); > } my_map SEC(".maps"); > ``` > I have several fentry/fexit programs that need to perform updates in > this map. After a certain number of map entries has been reached, > calls to bpf_map_update_elem start returning `-ENOMEM`. As one > example, I'm observing a program deployment where we have 816032 > entries on a 64 CPU machine, and a certain portion of updates are > failing. I'm puzzled as to why this is occurring given that: > - The 1M entries should be preallocated upon map creation (since I'm > not using `BPF_F_NO_PREALLOC`) > - The host machine has over 120G of unused memory available at any > given time I hoped that I might be able to help here, given that I wrote the documentation for BPF_MAP_TYPE_LRU_HASH. Unfortunately the details of LRU eviction are complex, especially when using BPF_F_NO_COMMON_LRU for per-cpu LRU lists. The LRU documentation was updated by Joe Stringer, including a flowchart which you might find helpful: https://docs.kernel.org/bpf/map_hash.html Joe also gave a talk about LRU maps LPC a couple of years ago which might give some insight: https://lpc.events/event/16/contributions/1368/ > I've previously reduced max_entries by 25% under the assumption that > this would prevent the problem from occurring, but this only caused > map updates to start failing at a lower threshold. I believe that this > is a problem with maps using the `BPF_F_NO_COMMON_LRU` flag, my > reasoning being that when map updates fail, it occurs consistently for > specific CPUs. > At this time, all machines experiencing the problem are running kernel > version 5.15, however I'm not currently able to try out any newer > kernels to confirm whether or not the same problem occurs there. Any > ideas on what could be responsible for this would be greatly > appreciated! There have been several updates to the LRU map code since 5.15 so it is definitely possible that it will behave differently on a 6.x kernel. > Thanks, > Chase Hiltz