On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote: > On 8/29/19 11:39 AM, Jakub Kicinski wrote: > > On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote: > >> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one > >> map entries per syscall. > >> https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@xxxxxxxxxxxxxx/T/#t > >> > >> During discussion, we found more use cases can be supported in a similar > >> map operation batching framework. For example, batched map lookup and delete, > >> which can be really helpful for bcc. > >> https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243 > >> https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138 > >> > >> Also, in bcc, we have API to delete all entries in a map. > >> https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264 > >> > >> For map update, batched operations also useful as sometimes applications need > >> to populate initial maps with more than one entry. For example, the below > >> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c: > >> https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550 > >> > >> This patch addresses all the above use cases. To make uapi stable, it also > >> covers other potential use cases. Four bpf syscall subcommands are introduced: > >> BPF_MAP_LOOKUP_BATCH > >> BPF_MAP_LOOKUP_AND_DELETE_BATCH > >> BPF_MAP_UPDATE_BATCH > >> BPF_MAP_DELETE_BATCH > >> > >> In userspace, application can iterate through the whole map one batch > >> as a time, e.g., bpf_map_lookup_batch() in the below: > >> p_key = NULL; > >> p_next_key = &key; > >> while (true) { > >> err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values, > >> &batch_size, elem_flags, flags); > >> if (err) ... > >> if (p_next_key) break; // done > >> if (!p_key) p_key = p_next_key; > >> } > >> Please look at individual patches for details of new syscall subcommands > >> and examples of user codes. > >> > >> The testing is also done in a qemu VM environment: > >> measure_lookup: max_entries 1000000, batch 10, time 342ms > >> measure_lookup: max_entries 1000000, batch 1000, time 295ms > >> measure_lookup: max_entries 1000000, batch 1000000, time 270ms > >> measure_lookup: max_entries 1000000, no batching, time 1346ms > >> measure_lookup_delete: max_entries 1000000, batch 10, time 433ms > >> measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms > >> measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms > >> measure_lookup_delete: max_entries 1000000, not batch, time 1894ms > >> measure_delete: max_entries 1000000, batch, time 220ms > >> measure_delete: max_entries 1000000, not batch, time 1289ms > >> For a 1M entry hash table, batch size of 10 can reduce cpu time > >> by 70%. Please see patch "tools/bpf: measure map batching perf" > >> for details of test codes. > > > > Hi Yonghong! > > > > great to see this, we have been looking at implementing some way to > > speed up map walks as well. > > > > The direction we were looking in, after previous discussions [1], > > however, was to provide a BPF program which can run the logic entirely > > within the kernel. > > > > We have a rough PoC on the FW side (we can offload the program which > > walks the map, which is pretty neat), but the kernel verifier side > > hasn't really progressed. It will soon. > > > > The rough idea is that the user space provides two programs, "filter" > > and "dumper": > > > > bpftool map exec id XYZ filter pinned /some/prog \ > > dumper pinned /some/other_prog > > > > Both programs get this context: > > > > struct map_op_ctx { > > u64 key; > > u64 value; > > } > > > > We need a per-map implementation of the exec side, but roughly maps > > would do: > > > > LIST_HEAD(deleted); > > > > for entry in map { > > struct map_op_ctx { > > .key = entry->key, > > .value = entry->value, > > }; > > > > act = BPF_PROG_RUN(filter, &map_op_ctx); > > if (act & ~ACT_BITS) > > return -EINVAL; > > > > if (act & DELETE) { > > map_unlink(entry); > > list_add(entry, &deleted); > > } > > if (act & STOP) > > break; > > } > > > > synchronize_rcu(); > > > > for entry in deleted { > > struct map_op_ctx { > > .key = entry->key, > > .value = entry->value, > > }; > > > > BPF_PROG_RUN(dumper, &map_op_ctx); > > map_free(entry); > > } > > > > The filter program can't perform any map operations other than lookup, > > otherwise we won't be able to guarantee that we'll walk the entire map > > (if the filter program deletes some entries in a unfortunate order). > > Looks like you will provide a new program type and per-map > implementation of above code. My patch set indeed avoided per-map > implementation for all of lookup/delete/get-next-key... Indeed, the simple batched ops are undeniably lower LoC. > > If user space just wants a pure dump it can simply load a program which > > dumps the entries into a perf ring. > > percpu perf ring is not really ideal for user space which simply just > want to get some key/value pairs back. Some kind of generate non-per-cpu > ring buffer might be better for such cases. I don't think it had to be per-cpu, but I may be blissfully ignorant about the perf ring details :) bpf_perf_event_output() takes flags, which are effectively selecting the "output CPU", no? > > I'm bringing this up because that mechanism should cover what is > > achieved with this patch set and much more. > > The only case it did not cover is batched update. But that may not > be super critical. Right, my other concern (which admittedly is slightly pedantic) is the potential loss of modifications. the lookup_and_delete() operation does not guarantee that the result returned to user space is the "final" value. We'd need to wait for a RCU grace period between lookup and dump. The "map is definitely unused now"/RCU guarantee had came up in the past. > Your approach give each element an action choice through another bpf > program. This indeed powerful. My use case is simpler than your use case > below, hence the implementation. Agreed, the simplicity is tempting. > > In particular for networking workloads where old flows have to be > > pruned from the map periodically it's far more efficient to communicate > > to user space only the flows which timed out (the delete batching from > > this set won't help at all). > > Maybe LRU map will help in this case? It is designed for such > use cases. LRU map would be perfect if it dumped the entry somewhere before it got reused... Perhaps we need to attach a program to the LRU map that'd get run when entries get reaped. That'd be cool 🤔 We could trivially reuse the "dumper" prog type for this. > > With a 2M entry map and this patch set we still won't be able to prune > > once a second on one core. > > > > [1] > > https://lore.kernel.org/netdev/20190813130921.10704-4-quentin.monnet@xxxxxxxxxxxxx/