On Mon, Aug 16, 2021 at 10:04:58PM +1000, Duncan Roe wrote: > On Sun, Aug 15, 2021 at 04:12:04PM +0200, Pablo Neira Ayuso wrote: > > On Sun, Aug 15, 2021 at 03:32:30PM +0200, alexandre.ferrieux@xxxxxxxxxx wrote: > > > On 8/15/21 3:07 PM, Pablo Neira Ayuso wrote: > > > > On Sun, Aug 15, 2021 at 02:17:08PM +0200, alexandre.ferrieux@xxxxxxxxxx wrote: > > > > [...] > > > > > So, the only way forward would be a separate hashtable on ids. > > > > > > > > Using the rhashtable implementation is fine for this, it's mostly > > > > boilerplate code that is needed to use it and there are plenty of > > > > examples in the kernel tree if you need a reference. > > > > > > Thanks, that's indeed pretty simple. I was just worried that people would > > > object to adding even the slightest overhead (hash_add/hash_del) to the > > > existing code path, that satisfies 99% of uses (LIFO). What do you think ? > > > > It should be possible to maintain both the list and the hashtable, > > AFAICS, the batch callback still needs the queue_list. > > > > > > > PS: what is the intended dominant use case for batch verdicts ? > > > > > > > > Issuing a batch containing several packets helps to amortize the > > > > cost of the syscall. > > > > > > Yes, but that could also be achieved by passing an array of ids. > > > > You mean, one single sendmsg() with several netlink messages, that > > would also work to achieve a similar batching effect. > > sendmsg() can actually be slower. I gave up on a project to send verdicts using > sendmsg() (especially with large mangled packets), because benchmarking showed: > > 1. on a 3YO laptop, no discernable difference. > > 2. On a 1YO Ryzen desktop, sendmsg() significantly slower. > > sendmsg() sent 3 or 4 buffers: 1. leading netlink message blocks; 2. the packet; > 3. padding to 4 bytes (if required); last: trailing netlink message blocks. > > sendmsg() saved moving these data blocks into a single buffer. But it introduced > the overhead of the kernel's having to validate 4 userland buffers instead of 1. > > A colleague suggested the Ryzen result is because of having 128-bit registers to > move data. I guess that must be it. > > The spreadsheets from the tests are up on GitHub: > https://github.com/duncan-roe/nfq6/blob/main/laptop_timings.ods > https://github.com/duncan-roe/nfq6/blob/main/timings.ods Just a quick test creating 64K entries in the conntrack table, using libmnl. - With batching # time ./batch real 0m0,122s user 0m0,010s sys 0m0,112s - Without batching # time ./nobatch real 0m0,195s user 0m0,049s sys 0m0,146s Just a sample, repeating the tests show similar numbers. Submitting a verdict on a packet via nfnetlink_queue is similar to creating an ct entry through ctnetlink (both use the send syscall). If you only have a few packets waiting for verdict in userspace, then probably batching is not helping much. BTW, leading and trailing netlink message blocks to the kernel are not required for nfnetlink_queue.