Re: nfnetlink_queue -- why linear lookup ?

Duncan Roe <duncan_roe@xxxxxxxxxxxxxxx> · Mon, 16 Aug 2021 22:04:58 +1000

On Sun, Aug 15, 2021 at 04:12:04PM +0200, Pablo Neira Ayuso wrote:
> On Sun, Aug 15, 2021 at 03:32:30PM +0200, alexandre.ferrieux@xxxxxxxxxx wrote:
> > On 8/15/21 3:07 PM, Pablo Neira Ayuso wrote:
> > > On Sun, Aug 15, 2021 at 02:17:08PM +0200, alexandre.ferrieux@xxxxxxxxxx wrote:
> > > [...]
> > > > So, the only way forward would be a separate hashtable on ids.
> > >
> > > Using the rhashtable implementation is fine for this, it's mostly
> > > boilerplate code that is needed to use it and there are plenty of
> > > examples in the kernel tree if you need a reference.
> >
> > Thanks, that's indeed pretty simple. I was just worried that people would
> > object to adding even the slightest overhead (hash_add/hash_del) to the
> > existing code path, that satisfies 99% of uses (LIFO). What do you think ?
>
> It should be possible to maintain both the list and the hashtable,
> AFAICS, the batch callback still needs the queue_list.
>
> > > > PS: what is the intended dominant use case for batch verdicts ?
> > >
> > > Issuing a batch containing several packets helps to amortize the
> > > cost of the syscall.
> >
> > Yes, but that could also be achieved by passing an array of ids.
>
> You mean, one single sendmsg() with several netlink messages, that
> would also work to achieve a similar batching effect.

sendmsg() can actually be slower. I gave up on a project to send verdicts using
sendmsg() (especially with large mangled packets), because benchmarking showed:

1. on a 3YO laptop, no discernable difference.

2. On a 1YO Ryzen desktop, sendmsg() significantly slower.

sendmsg() sent 3 or 4 buffers: 1. leading netlink message blocks; 2. the packet;
3. padding to 4 bytes (if required); last: trailing netlink message blocks.

sendmsg() saved moving these data blocks into a single buffer. But it introduced
the overhead of the kernel's having to validate 4 userland buffers instead of 1.

A colleague suggested the Ryzen result is because of having 128-bit registers to
move data. I guess that must be it.

The spreadsheets from the tests are up on GitHub:
https://github.com/duncan-roe/nfq6/blob/main/laptop_timings.ods
https://github.com/duncan-roe/nfq6/blob/main/timings.ods

Cheers ... Duncan.