Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Sat, 15 Mar 2025 16:49:20 -0400

On Wed, Mar 12, 2025 at 11:11 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > [snip]
> > >> Also we have the ephemeral PDC connections>> that come and go as
> > needed. There more such objects coming with more
> > >> state, configuration and lifecycle management. That is why we added a
> > >> separate netlink family to cleanly manage them without trying to fit
> > >> a square peg in a round hole so to speak.
> > >
> > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > which is very questionable. It is slow, unreliable, requires sockets,
> > > needs more parsing logic e.t.c

To chime in on the above re: netlink vs ioctl,
[this is going to be a long message - over caffeinated and stuck on a trip....]

On "slow" - Mostly netlink can be deemed to "slow" for the following
reasons 1) locks - which over the last year have been highly reduced
2) crossing user/kernel - which i believe is fixable with some mmap
scheme (although past attempts at doing this have been unsuccessful)
3)async vs ioctl sync (more below)

On "unreliable": This is typically a result of some request response
(or a subscribed to event) whose execution has failed to allocate
memory in the kernel or overrun some buffers towards user space;
however, any such failures are signalled to user space and can be
recovered from.

ioctl is synchronous which gives it the "reliability" and "speed".
iirc, if memory failure was to happen on ioctl it will block until it
is successful? vs netlink which is async and will get signalled to
user space if data is lost or cant be fully delivered. Example, if a
user issued a dump of a very large amount of data from the kernel and
that data wasnt fully delivered perhaps because of memory pressure,
user space will be notified via socket errors and can use that info to
recover.

Extensibility: ioctl take binary structs which make it much harder to
extend but adds to that "speed". Once you pick your struct, you are
stuck with it - as opposed to netlink which uses very extensible
formally defined TLVs that makes it highly extensible. Yes,
extensibility requires more parsing as you stated above. Note: if you
have one-offs you could just hardcode a ioctl-like data structure into
a TLV and use blocking netlink sockets and that should get you pretty
close to ioctl "speed"

To build more on reliability: if you really cared, there are
mechanisms which can be used to build a fully reliable mechanism of
communication with the kernel since netlink is infact a wire protocol
(which alas has been broken for a while because you cant really use it
as a wire protocol across machines); see for example:
https://datatracker.ietf.org/doc/html/rfc3549#section-2.3.2.1
And if you dont really care about reliability you can just shoot
messages into the kernel and turn off the ACK flag (and then issue
requests when you feel you need to check on configuration).

Debuggability: extended ACKs(heavily used by networking) provide an
excellent operational information user space in fine grained details
on errors (famous EINVAL can tell you exactly what the EINVAL means
for example).

netlink has a multicast publish-subscribe mechanism. Multicast being
one-to-many means multi-user(important detail for both scaling and
independent debugging) interface. Meaning you can have multiple
processes subscribing to events that the kernel publishes. You dont
have to resort to polling the kernel for details of dynamic changes
(example "a new entry has been added to table foo" etc)
As a matter of fact, original design  used to allow user space to
advertise to both kernel and other user space apps (and unicast worked
to/from kernel/user and user/user). I haent looked at that recently,
so it could be broken.
Note: while these events are also subject to message loss - netlink
robustness described earlier is usable here as well (via socket
errors).
Example, if the kernel attempted to send an event which had the
misfortune of not making it - user will be notified and can recover by
requesting a related table dump, etc to see what changed..

- And as Nik mentioned: The new (yaml)model-to-generatedcode approach
that is now common in generic netlink highly reduces developer effort.
Although in my opinion we really need this stuff integrated into tools
like iproute2..

I am pretty sure i left out some important details (maybe i can write
a small doc when i am in better shape).

cheers,
jamal