Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure

Anton Protopopov <aspsk@xxxxxxxxxxxxx> · Wed, 28 Jun 2023 13:17:23 +0000

Hi Alexey, hi Martin,

On Thu, Jun 01, 2023 at 11:24:20AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@xxxxxxxxxxxxx> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
> 
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
> 
> Pls cc Martin in the future.

I've looked into this a bit more and the problem is as follows.

LRU maps allocate MAX_ENTRIES elements and put them in the global free list.
Then each CPU will try to get memory in 128 elements chunks into their own
local free lists.

The user expectation is that evictions start when the map is full, however, on
practice we start evicting elements when capacity reaches about (MAX_ENTRIES -
NCPUS*128) elements. This happens because when one CPU have used its local
free-list, it gets to the global lists. While there could be another
(NCPUS-1)*128 free elements in local free lists of other CPUs, our CPU goes to
the global free list, which is empty, and then starts to evict elements from
active/inactive lists (a 128 elements chunk). Then this can happen for another
active CPU, etc.

This looks like not a problem for big maps, where (NCPUS*128) is not a big %%
of the total map capacity. For smaller maps this may be unexpected (I first
noticed this on a 4K map where after updating 4K keys map capacity was about
200-300 elements).

My first attempt to fix this was to just increase nr_entries allocated for the
map by NCPUS*128, which makes evictions to start happening at MAX_ENTRIES.  But
soon I've realized that in such way users can get more than MAX_ENTRIES inside
a map, which is unexpected as well (say, when dumping a map in a location of
MAX_ENTRIES size or syncing entries with another map of MAX_ENTRIES capacity).

I also briefly looked into allowing to call the prealloc_lru_pop() function
under a bucket lock (by passing the currently locked bucket to it so that this
pointer is passed all the way to the htab_lru_map_delete_node() function which
may then bypass locking the bucket if it is the same one). Looks like this
works, but I didn't have time to understand if this breaks the LRU architecture
badly or not.