Re: [PATCH v5 bpf-next] selftests/bpf: Add benchmark for local_storage get

Martin KaFai Lau <kafai@xxxxxx> · Fri, 24 Jun 2022 13:22:20 -0700

On Wed, Jun 22, 2022 at 04:00:04PM -0700, Yosry Ahmed wrote:
> Thanks for adding these benchmarks!
> 
> 
> On Sat, Jun 4, 2022 at 3:20 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
> >
> > Add a benchmarks to demonstrate the performance cliff for local_storage
> > get as the number of local_storage maps increases beyond current
> > local_storage implementation's cache size.
> >
> > "sequential get" and "interleaved get" benchmarks are added, both of
> > which do many bpf_task_storage_get calls on sets of task local_storage
> > maps of various counts, while considering a single specific map to be
> > 'important' and counting task_storage_gets to the important map
> > separately in addition to normal 'hits' count of all gets. Goal here is
> > to mimic scenario where a particular program using one map - the
> > important one - is running on a system where many other local_storage
> > maps exist and are accessed often.
> >
> > While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> > ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> > bpf_task_storage_gets for the important map for every 10 map gets. This
> > is meant to highlight performance differences when important map is
> > accessed far more frequently than non-important maps.
> >
> > A "hashmap control" benchmark is also included for easy comparison of
> > standard bpf hashmap lookup vs local_storage get. The benchmark is
> > similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> > instead of local storage. Only one inner map is created - a hashmap
> > meant to hold tid -> data mapping for all tasks. Size of the hashmap is
> > hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
> > keys which are actually fetched as part of the benchmark is
> > configurable.
> >
> > Addition of this benchmark is inspired by conversation with Alexei in a
> > previous patchset's thread [0], which highlighted the need for such a
> > benchmark to motivate and validate improvements to local_storage
> > implementation. My approach in that series focused on improving
> > performance for explicitly-marked 'important' maps and was rejected
> > with feedback to make more generally-applicable improvements while
> > avoiding explicitly marking maps as important. Thus the benchmark
> > reports both general and important-map-focused metrics, so effect of
> > future work on both is clear.
> 
> The current implementation falls back to a list traversal of
> bpf_local_storage->list when there is a cache miss. I wonder if this
> is a place with room for optimization? Maybe a hash table or a tree
> would be a more performant alternative?
With a benchmark to ensure it won't regress the common use cases
and guage the potential improvement,  it is probably something Dave
is planning to explore next if I read the last thread correctly.

How many task storages is in your production use cases ?