Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support

Stanislav Fomichev <sdf@xxxxxxxxxxx> · Tue, 3 Sep 2019 15:30:43 -0700

On 09/03, Alexei Starovoitov wrote:
> On Fri, Aug 30, 2019 at 02:18:09PM -0700, Stanislav Fomichev wrote:
> > > > 
> > > > I personally like Jakub's/Quentin's proposal more. So if I get to choose
> > > > between this series and Jakub's filter+dump in BPF, I'd pick filter+dump
> > > > (pending per-cpu issue which we actually care about).
> > > > 
> > > > But if we can have both, I don't have any objections; this patch
> 
> I think we need to have both.
> imo Jakub's and Yonghong's approach are solving slightly different cases.
> 
> filter+dump via program is better suited for LRU map walks where filter prog
> would do some non-trivial logic.
> Whereas plain 'delete all' or 'dump all' is much simpler to use without
> loading yet another prog just to dump it.
> bpf infra today isn't quite ready for this very short lived auxiliary progs.
> At prog load pages get read-only mapping, tlbs across cpus flushed,
> kallsyms populated, FDs allocated, etc.
> Loading the prog is a heavy operation. There was a chatter before to have
> built-in progs. This filter+dump could benefit from builtin 'allow all'
> or 'delete all' progs, but imo that complicates design and asks even
> more questions than it answers. Should this builtin progs show up
> in 'bpftool prog show' ? When do they load/unload? Same safety requirements
> as normal progs? etc.
> imo it's fine to have little bit overlap between apis.
> So I think we should proceed with both batching apis.
We don't need to load filter+dump every time we need a dump, right?
We, internally, want to have this 'batch dump' only for long running daemons
(I think the same applies to bcc), we can load this filter+dump once and
then have a sys_bpf() command to trigger it.

Also, related, if we add this batch dump, it doesn't mean that
everything should switch to it. For example, I feel like we
are perfectly fine if bpftool still uses get_next_key+lookup
since we use it only for debugging.

> Having said that I think both are suffering from the important issue pointed out
> by Brian: when kernel deletes an element get_next_key iterator over hash/lru
> map will produce duplicates.
> The amount of duplicates can be huge. When batched iterator is slow and
> bpf prog is doing a lot of update/delete, there could be 10x worth of duplicates,
> since walk will resume from the beginning.
> User space cannot be tasked to deal with it.
> I think this issue has to be solved in the kernel first and it may require
> different batching api.
> 
> One idea is to use bucket spin_lock and batch process it bucket-at-a-time.
> From api pov the user space will tell kernel:
> - here is the buffer for N element. start dump from the beginning.
> - kernel will return <= N elements and an iterator.
> - user space will pass this opaque iterator back to get another batch
> For well behaved hash/lru map there will be zero or one elements per bucket.
> When there are 2+ the batching logic can process them together.
> If 'lookup' is requested the kernel can check whether user space provided
> enough space for these 2 elements. If not abort the batch earlier.
> get_next_key won't be used. Instead some sort of opaque iterator
> will be returned to user space, so next batch lookup can start from it.
> This iterator could be the index of the last dumped bucket.
> This idea won't work for pathological hash tables though.
> A lot of elements in a single bucket may be more than room for single batch.
> In such case iterator will get stuck, since num_of_elements_in_bucket > batch_buf_size.
> May be special error code can be used to solve that?
This all requires new per-map implementations unfortunately :-(
We were trying to see if we can somehow improve the existing bpf_map_ops
to be more friendly towards batching.

You also bring a valid point with regard to well behaved hash/lru,
we might be optimizing for the wrong case :-)

> I hope we can come up with other ideas to have a stable iterator over hash table.
> Let's use email to describe the ideas and upcoming LPC conference to
> sort out details and finalize the one to use.