Re: [RFC Patch bpf-next] bpf: introduce bpf timer

Joe Stringer <joe@xxxxxxxxx> · Wed, 11 Aug 2021 14:03:07 -0700

Hi folks, apparently I never clicked 'send' on this email, but if you
wanted to continue the discussion I had some questions and thoughts.

This is also an interesting enough topic that it may be worth
considering to submit for the upcoming LPC Networking & BPF track
(submission deadline is this Friday August 13, Conference dates 20-24
September).

On Thu, May 13, 2021 at 7:53 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> On Thu, May 13, 2021 at 11:46 AM Jamal Hadi Salim <jhs@xxxxxxxxxxxx> wrote:
> >
> > On 2021-05-12 6:43 p.m., Jamal Hadi Salim wrote:
> >
> > >
> > > Will run some tests tomorrow to see the effect of batching vs nobatch
> > > and capture cost of syscalls and cpu.
> > >
> >
> > So here are some numbers:
> > Processor: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
> > This machine is very similar to where a real deployment
> > would happen.
> >
> > Hyperthreading turned off so we can dedicate the core to the
> > dumping process and Performance mode on, so no frequency scaling
> > meddling.
> > Tests were ran about 3 times each. Results eye-balled to make
> > sure deviation was reasonable.
> > 100% of the one core was used just for dumping during each run.
>
> I checked with Cilium users here at Bytedance, they actually observed
> 100% CPU usage too.

Thanks for the feedback. Can you provide further details? For instance,

* Which version of Cilium?
* How long do you observe this 100% CPU usage?
* What size CT map is in use?
* How frequently do you intend for CT GC to run? (Do you use the
default settings or are they mismatched with your requirements for
some reason? If so can we learn more about the requirements/why?)
* Do you have a threshold in mind that would be sufficient?

If necessary we can take these discussions off-list if the details are
sensitive but I'd prefer to continue the discussion here to have some
public examples we can discuss & use to motivate future discussions.
We can alternatively move the discussion to a Cilium GitHub issue if
the tradeoffs are more about the userspace implementation rather than
the kernel specifics, though I suspect some of the folks here would
also like to follow along so I don't want to exclude the list from the
discussion.

FWIW I'm not inherently against a timer, in fact I've wondered for a
while what kind of interesting things we could build with such
support. At the same time, connection tracking entry management is a
nuanced topic and it's easy to fix an issue in one area only to
introduce a problem in another area.

> >
> > bpftool does linear retrieval whereas our tool does batch dumping.
> > bpftool does print the dumped results, for our tool we just count
> > the number of entries retrieved (cost would have been higher if
> > we actually printed). In any case in the real setup there is
> > a processing cost which is much higher.
> >
> > Summary is: the dumping is problematic costwise as the number of
> > entries increase. While batching does improve things it doesnt
> > solve our problem (Like i said we have upto 16M entries and most
> > of the time we are dumping useless things)
>
> Thank you for sharing these numbers! Hopefully they could convince
> people here to accept the bpf timer. I will include your use case and
> performance number in my next update.

Yes, Thanks Jamal for the numbers. It's very interesting, clearly
batch dumping is far more efficient and we should enhance bpftool to
take advantage of it where applicable.

> Like i said we have upto 16M entries and most
> of the time we are dumping useless things)

I'm curious if there's a more intelligent way to figure out this
'dumping useless things' aspect? I can see how timers would eliminate
the cycles spent on the syscall aspect of this entirely (in favor of
the timer handling logic which I'd guess is cheaper), but at some
point if you're running certain logic on every entry in a map then of
course it will scale linearly.

The use case is different for the CT problem we discussed above, but
if I look at the same question for the CT case, this is why I find LRU
useful - rather than firing off a number of timers linear on the size
of the map, the eviction logic is limited to the map insert rate,
which itself can be governed and ratelimited by logic running in eBPF.
The scan of the map then becomes less critical, so it can be run less
frequently and alleviate the CPU usage question that way.