Re: [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 14 Aug 2019 13:18:27 -0700

On Wed, Aug 14, 2019 at 10:12 AM Quentin Monnet
<quentin.monnet@xxxxxxxxxxxxx> wrote:
>
> 2019-08-14 09:58 UTC-0700 ~ Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx>
> > On Wed, Aug 14, 2019 at 9:45 AM Edward Cree <ecree@xxxxxxxxxxxxxx> wrote:
> >>
> >> On 14/08/2019 10:42, Quentin Monnet wrote:
> >>> 2019-08-13 18:51 UTC-0700 ~ Alexei Starovoitov
> >>> <alexei.starovoitov@xxxxxxxxx>
> >>>> The same can be achieved by 'bpftool map dump|grep key|wc -l', no?
> >>> To some extent (with subtleties for some other map types); and we use a
> >>> similar command line as a workaround for now. But because of the rate of
> >>> inserts/deletes in the map, the process often reports a number higher
> >>> than the max number of entries (we observed up to ~750k when max_entries
> >>> is 500k), even is the map is only half-full on average during the count.
> >>> On the worst case (though not frequent), an entry is deleted just before
> >>> we get the next key from it, and iteration starts all over again. This
> >>> is not reliable to determine how much space is left in the map.
> >>>
> >>> I cannot see a solution that would provide a more accurate count from
> >>> user space, when the map is under pressure?
> >> This might be a really dumb suggestion, but: you're wanting to collect a
> >>  summary statistic over an in-kernel data structure in a single syscall,
> >>  because making a series of syscalls to examine every entry is slow and
> >>  racy.  Isn't that exactly a job for an in-kernel virtual machine, and
> >>  could you not supply an eBPF program which the kernel runs on each entry
> >>  in the map, thus supporting people who want to calculate something else
> >>  (mean, min and max, whatever) instead of count?
> >
> > Pretty much my suggestion as well :)

I also support the suggestion to count it from BPF side. It's flexible
and powerful approach and doesn't require adding more and more nuanced
sub-APIs to kernel to support subset of bulk operations on map
(subset, because we'll expose count, but what about, e.g., p50, etc,
there will always be something more that someone will want and it just
doesn't scale).

> >
> > It seems the better fix for your nat threshold is to keep count of
> > elements in the map in a separate global variable that
> > bpf program manually increments and decrements.
> > bpftool will dump it just as regular map of single element.
> > (I believe it doesn't recognize global variables properly yet)
> > and BTF will be there to pick exactly that 'count' variable.
> >
>
> It would be with an offloaded map, but yes, I suppose we could keep
> track of the numbers in a separate map. We'll have a look into this.

See if you can use a global variable, that way you completely
eliminate any overhead from BPF side of things, except for atomic
increment.

>
> Thanks to both of you for the suggestions.
> Quentin