On 4/21/22 9:40 PM, Alexei Starovoitov wrote: > On Tue, Apr 19, 2022 at 05:21:40PM -0700, Dave Marchevsky wrote: >> Currently, each local_storage type (sk, inode, task) has a 16-entry >> cache for local_storage data associated with a particular map. A >> local_storage map is assigned a fixed cache_idx when it is allocated. >> When looking in a local_storage for data associated with a map the cache >> entry at cache_idx is the only place the map can appear in cache. If the >> map's data is not in cache it is placed there after a search through the >> cache hlist. When there are >16 local_storage maps allocated for a >> local_storage type multiple maps have same cache_idx and thus may knock >> each other out of cache. >> >> BPF programs that use local_storage may require fast and consistent >> local storage access. For example, a BPF program using task local >> storage to make scheduling decisions would not be able to tolerate a >> long hlist search for its local_storage as this would negatively affect >> cycles available to applications. Providing a mechanism for such a >> program to ensure that its local_storage_data will always be in cache >> would ensure fast access. >> >> This series introduces a BPF_LOCAL_STORAGE_FORCE_CACHE flag that can be >> set on sk, inode, and task local_storage maps via map_extras. When a map >> with the FORCE_CACHE flag set is allocated it is assigned an 'exclusive' >> cache slot that it can't be evicted from until the map is free'd. >> >> If there are no slots available to exclusively claim, the allocation >> fails. BPF programs are expected to use BPF_LOCAL_STORAGE_FORCE_CACHE >> only if their data _must_ be in cache. > > I'm afraid new uapi flag doesn't solve this problem. > Sooner or later every bpf program would deem itself "important" and > performance critical. All of them will be using FORCE_CACHE flag > and we will back to the same situation. In this scenario, if 16 maps had been loaded w/ FORCE_CACHE flag and 17th tried to load, it would fail, so programs depending on the map would fail to load. Patch 2 adds a selftest 'local_storage_excl_cache_fail' demonstrating this. > Also please share the performance data that shows more than 16 programs > that use local storage at the same time and existing simple cache > replacing logic is not enough. > For any kind link list walking to become an issue there gotta be at > least 17 progs. Two progs should pick up the same cache_idx and > run interleaved to evict each other. > It feels like unlikely scenario, so real data would be good to see. > If it really an issue we might need a different caching logic. > Like instead of single link list per local storage we might > have 16 link lists. cache_idx can point to a slot. > If it's not 1st it will be a 2nd in much shorter link list. > With 16 slots the link lists will have 2 elements until 32 bpf progs > are using local storage. > We can get rid of cache too and replace with mini hash table of N > elements where map_id would be an index into a hash table. > All sorts of other algorithms are possible. > In any case the bpf user shouldn't be telling the kernel about > "importance" of its program. If program is indeed executing a lot > the kernel should be caching/accelerating it where it can. It's worth noting that this is a map-level setting, not prog-level. Telling the kernel about importance of data feels more palatable to me. Sort of like mmap's MAP_LOCKED, but for local_storage cache. Going back to the motivating example - using data in task local_storage to make scheduling decisions - the desire is to have the task local_storage access be like "accessing a task_struct member" vs "doing a search for right data to access (w/ some caching to try to avoid search)". Re: performance data, would adding a benchmark in selftests/bpf/benchs work?