Re: [RFC Patch bpf-next] bpf: introduce bpf timer

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Wed, 12 May 2021 18:43:53 -0400

On 2021-05-11 1:05 a.m., Joe Stringer wrote:
Hi Cong,

and let me quote the original report here:

"The current implementation (as of v1.2) for managing the contents of
the datapath connection tracking map leaves something to be desired:
Once per minute, the userspace cilium-agent makes a series of calls to
the bpf() syscall to fetch all of the entries in the map to determine
whether they should be deleted. For each entry in the map, 2-3 calls
must be made: One to fetch the next key, one to fetch the value, and
perhaps one to delete the entry. The maximum size of the map is 1
million entries, and if the current count approaches this size then
the garbage collection goroutine may spend a significant number of CPU
cycles iterating and deleting elements from the conntrack map."

I'm also curious to hear more details as I haven't seen any recent
discussion in the common Cilium community channels (GitHub / Slack)
around deficiencies in the conntrack garbage collection since we
addressed the LRU issues upstream and switched back to LRU maps.

For our use case we cant use LRU. We need to account for every entry i.e
we dont want it to be gc without our consent. i.e we want to control
the GC. Your PR was pointing to LRU deleting some flow entries for TCP
which were just idling for example.

There's an update to the report quoted from the same link above:

"In recent releases, we've moved back to LRU for management of the CT
maps so the core problem is not as bad; furthermore we have
implemented a backoff for GC depending on the size and number of
entries in the conntrack table, so that in active environments the
userspace GC is frequent enough to prevent issues but in relatively
passive environments the userspace GC is only rarely run (to minimize
CPU impact)."

By "core problem is not as bad", I would have been referring to the
way that failing to garbage collect hashtables in a timely manner can
lead to rejecting new connections due to lack of available map space.
Switching back to LRU mitigated this concern. With a reduced frequency
of running the garbage collection logic, the CPU impact is lower as
well. I don't think we've explored batched map ops for this use case
yet either, which would already serve to improve the CPU usage
situation without extending the kernel.

Will run some tests tomorrow to see the effect of batching vs nobatch
and capture cost of syscalls and cpu.

Note: even then, it is not a good general solution. Our entries can
go as high as 16M.
Our workflow is: 1) every 1-5 seconds you dump, 2) process for
what needs to be deleted etc, then do updates (another 1-3 seconds
worth of time). There is a point, depending on number of entries,
where there your time cost of processing exceeds your polling period.
The likelihood of entry state loss is high for even 1/2 sec loss
of sync.

The main outstanding issue I'm aware of is that we will often have a
1:1 mapping of entries in the CT map and the NAT map, and ideally we'd
like them to have tied fates but currently we have no mechanism to do
this with LRU. When LRU eviction occurs, the entries can get out of
sync until the next GC.

Yes, this ties as well to our use case (not NAT for us, but semantically
similar challenge). It goes the other way too, if userspace decides
to adjust your NAT table you need to purge related entries from the
cache.

I could imagine timers helping with this if we

Yes, timers would solve this.

I am not even arguing that we need timers to solve these issues. I am
just saying it seems timers are just fundamental infra that is needed
even outside the scope of this.

cheers,
jamal