Re: [PATCH bpf-next 0/3] Introduce local_storage exclusive caching

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 21 Apr 2022 18:40:30 -0700

On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
> Currently, each local_storage type (sk, inode, task) has a 16-entry
> cache for local_storage data associated with a particular map. A
> local_storage map is assigned a fixed cache_idx when it is allocated.
> When looking in a local_storage for data associated with a map the cache
> entry at cache_idx is the only place the map can appear in cache. If the
> map's data is not in cache it is placed there after a search through the
> cache hlist. When there are >16 local_storage maps allocated for a
> local_storage type multiple maps have same cache_idx and thus may knock
> each other out of cache.
> 
> BPF programs that use local_storage may require fast and consistent
> local storage access. For example, a BPF program using task local
> storage to make scheduling decisions would not be able to tolerate a
> long hlist search for its local_storage as this would negatively affect
> cycles available to applications. Providing a mechanism for such a
> program to ensure that its local_storage_data will always be in cache
> would ensure fast access.
> 
> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
> set on sk, inode, and task local_storage maps via map_extras. When a map
> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
> cache slot that it can't be evicted from until the map is free'd. 
> 
> If there are no slots available to exclusively claim, the allocation
> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
> only if their data _must_ be in cache.

I'm afraid new uapi flag doesn't solve this problem.
Sooner or later every bpf program would deem itself "important" and
performance critical. All of them will be using FORCE_CACHE flag
and we will back to the same situation.

Also please share the performance data that shows more than 16 programs
that use local storage at the same time and existing simple cache
replacing logic is not enough.
For any kind link list walking to become an issue there gotta be at
least 17 progs. Two progs should pick up the same cache_idx and
run interleaved to evict each other. 
It feels like unlikely scenario, so real data would be good to see.
If it really an issue we might need a different caching logic.
Like instead of single link list per local storage we might
have 16 link lists. cache_idx can point to a slot.
If it's not 1st it will be a 2nd in much shorter link list.
With 16 slots the link lists will have 2 elements until 32 bpf progs
are using local storage.
We can get rid of cache too and replace with mini hash table of N
elements where map_id would be an index into a hash table.
All sorts of other algorithms are possible.
In any case the bpf user shouldn't be telling the kernel about
"importance" of its program. If program is indeed executing a lot
the kernel should be caching/accelerating it where it can.