On 09/03, Alexei Starovoitov wrote: > On Fri, Aug 30, 2019 at 02:18:09PM -0700, Stanislav Fomichev wrote: > > > > > > > > I personally like Jakub's/Quentin's proposal more. So if I get to choose > > > > between this series and Jakub's filter+dump in BPF, I'd pick filter+dump > > > > (pending per-cpu issue which we actually care about). > > > > > > > > But if we can have both, I don't have any objections; this patch > > I think we need to have both. > imo Jakub's and Yonghong's approach are solving slightly different cases. > > filter+dump via program is better suited for LRU map walks where filter prog > would do some non-trivial logic. > Whereas plain 'delete all' or 'dump all' is much simpler to use without > loading yet another prog just to dump it. > bpf infra today isn't quite ready for this very short lived auxiliary progs. > At prog load pages get read-only mapping, tlbs across cpus flushed, > kallsyms populated, FDs allocated, etc. > Loading the prog is a heavy operation. There was a chatter before to have > built-in progs. This filter+dump could benefit from builtin 'allow all' > or 'delete all' progs, but imo that complicates design and asks even > more questions than it answers. Should this builtin progs show up > in 'bpftool prog show' ? When do they load/unload? Same safety requirements > as normal progs? etc. > imo it's fine to have little bit overlap between apis. > So I think we should proceed with both batching apis. We don't need to load filter+dump every time we need a dump, right? We, internally, want to have this 'batch dump' only for long running daemons (I think the same applies to bcc), we can load this filter+dump once and then have a sys_bpf() command to trigger it. Also, related, if we add this batch dump, it doesn't mean that everything should switch to it. For example, I feel like we are perfectly fine if bpftool still uses get_next_key+lookup since we use it only for debugging. > Having said that I think both are suffering from the important issue pointed out > by Brian: when kernel deletes an element get_next_key iterator over hash/lru > map will produce duplicates. > The amount of duplicates can be huge. When batched iterator is slow and > bpf prog is doing a lot of update/delete, there could be 10x worth of duplicates, > since walk will resume from the beginning. > User space cannot be tasked to deal with it. > I think this issue has to be solved in the kernel first and it may require > different batching api. > > One idea is to use bucket spin_lock and batch process it bucket-at-a-time. > From api pov the user space will tell kernel: > - here is the buffer for N element. start dump from the beginning. > - kernel will return <= N elements and an iterator. > - user space will pass this opaque iterator back to get another batch > For well behaved hash/lru map there will be zero or one elements per bucket. > When there are 2+ the batching logic can process them together. > If 'lookup' is requested the kernel can check whether user space provided > enough space for these 2 elements. If not abort the batch earlier. > get_next_key won't be used. Instead some sort of opaque iterator > will be returned to user space, so next batch lookup can start from it. > This iterator could be the index of the last dumped bucket. > This idea won't work for pathological hash tables though. > A lot of elements in a single bucket may be more than room for single batch. > In such case iterator will get stuck, since num_of_elements_in_bucket > batch_buf_size. > May be special error code can be used to solve that? This all requires new per-map implementations unfortunately :-( We were trying to see if we can somehow improve the existing bpf_map_ops to be more friendly towards batching. You also bring a valid point with regard to well behaved hash/lru, we might be optimizing for the wrong case :-) > I hope we can come up with other ideas to have a stable iterator over hash table. > Let's use email to describe the ideas and upcoming LPC conference to > sort out details and finalize the one to use.