Re: [PATCH bpf-next 0/3] Introduce local_storage exclusive caching

Dave Marchevsky <davemarchevsky@xxxxxx> · Fri, 22 Apr 2022 00:05:23 -0400

On 4/21/22 9:40 PM, Alexei Starovoitov wrote:   
> On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote:
>> Currently, each local_storage type (sk, inode, task) has a 16-entry
>> cache for local_storage data associated with a particular map. A
>> local_storage map is assigned a fixed cache_idx when it is allocated.
>> When looking in a local_storage for data associated with a map the cache
>> entry at cache_idx is the only place the map can appear in cache. If the
>> map's data is not in cache it is placed there after a search through the
>> cache hlist. When there are >16 local_storage maps allocated for a
>> local_storage type multiple maps have same cache_idx and thus may knock
>> each other out of cache.
>>
>> BPF programs that use local_storage may require fast and consistent
>> local storage access. For example, a BPF program using task local
>> storage to make scheduling decisions would not be able to tolerate a
>> long hlist search for its local_storage as this would negatively affect
>> cycles available to applications. Providing a mechanism for such a
>> program to ensure that its local_storage_data will always be in cache
>> would ensure fast access.
>>
>> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be
>> set on sk, inode, and task local_storage maps via map_extras. When a map
>> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive'
>> cache slot that it can't be evicted from until the map is free'd. 
>>
>> If there are no slots available to exclusively claim, the allocation
>> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE
>> only if their data _must_ be in cache.
> 
> I'm afraid new uapi flag doesn't solve this problem.
> Sooner or later every bpf program would deem itself "important" and
> performance critical. All of them will be using FORCE_CACHE flag
> and we will back to the same situation.

In this scenario, if 16 maps had been loaded w/ FORCE_CACHE flag and 17th tried
to load, it would fail, so programs depending on the map would fail to load.
Patch 2 adds a selftest 'local_storage_excl_cache_fail' demonstrating this.

> Also please share the performance data that shows more than 16 programs
> that use local storage at the same time and existing simple cache
> replacing logic is not enough.
> For any kind link list walking to become an issue there gotta be at
> least 17 progs. Two progs should pick up the same cache_idx and
> run interleaved to evict each other. 
> It feels like unlikely scenario, so real data would be good to see.
> If it really an issue we might need a different caching logic.
> Like instead of single link list per local storage we might
> have 16 link lists. cache_idx can point to a slot.
> If it's not 1st it will be a 2nd in much shorter link list.
> With 16 slots the link lists will have 2 elements until 32 bpf progs
> are using local storage.
> We can get rid of cache too and replace with mini hash table of N
> elements where map_id would be an index into a hash table.
> All sorts of other algorithms are possible.
> In any case the bpf user shouldn't be telling the kernel about
> "importance" of its program. If program is indeed executing a lot
> the kernel should be caching/accelerating it where it can.

It's worth noting that this is a map-level setting, not prog-level. Telling the
kernel about importance of data feels more palatable to me. Sort of like mmap's
MAP_LOCKED, but for local_storage cache.

Going back to the motivating example - using data in task local_storage to make
scheduling decisions - the desire is to have the task local_storage access be
like "accessing a task_struct member" vs "doing a search for right data to 
access (w/ some caching to try to avoid search)".

Re: performance data, would adding a benchmark in selftests/bpf/benchs work?