Re: [PATCH v6 bpf-next] selftests/bpf: Add benchmark for local_storage get

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 22 Jun 2022 19:18:15 -0700

On Wed, Jun 22, 2022 at 6:26 PM John Fastabend <john.fastabend@xxxxxxxxx> wrote:
>
> Martin KaFai Lau wrote:
> > On Tue, Jun 21, 2022 at 10:49:46PM -0700, John Fastabend wrote:
> > > Martin KaFai Lau wrote:
> > > > On Tue, Jun 21, 2022 at 12:17:54PM -0700, John Fastabend wrote:
> > > > > > Hashmap Control
> > > > > > ===============
> > > > > >         num keys: 10
> > > > > > hashmap (control) sequential    get:  hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
> > > > > >
> > > > > >         num keys: 1000
> > > > > > hashmap (control) sequential    get:  hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
> > > > > >
> > > > > >         num keys: 10000
> > > > > > hashmap (control) sequential    get:  hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
> > > > > >
> > > > > >         num keys: 100000
> > > > > > hashmap (control) sequential    get:  hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
> > > > > >
> > > > > >         num keys: 4194304
> > > > > > hashmap (control) sequential    get:  hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
> > > > > >
> > > > >
> > > > > Why is the hashmap lookup not constant with the number of keys? It looks
> > > > > like its prepopulated without collisions so I wouldn't expect any
> > > > > extra ops on the lookup side after looking at the code quickly.
> > > > It may be due to the cpu-cache misses as the map grows.
> > >
> > > Maybe but, values are just ints so even 1k * 4B = 4kB should be
> > > inside an otherwise unused server class system. Would be more
> > > believable (to me at least) if the drop off happened at 100k or
> > > more.
> > It is not only value (and key) size.  There is overhead.
> > htab_elem alone is 48bytes.  key and value need to 8bytes align also.
> >
>
> Right late night math didn't add up. Now I'm wondering if we can make
> hashmap behave much better, that drop off is looking really ugly.
>
> > From a random machine:
> > lscpu -C
> > NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
> > L1d       32K     576K    8 Data            1    64        1             64
> > L1i       32K     576K    8 Instruction     1    64        1             64
> > L2         1M      18M   16 Unified         2  1024        1             64
> > L3      24.8M    24.8M   11 Unified         3 36864        1             64
>
> Could you do a couple more data point then, num keys=100,200,400? I would
> expect those to fit in the cache and be same as 10 by the cache theory. I
> could try as well but looking like Friday before I have a spare moment.

I think the benchmark achieved its goal :)
It generated plenty of interesting data.
Pulling random out of hot loop and any other improvements
can be done as follow ups.
Pushed it to bpf-next.