Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Wed, 19 Mar 2025 14:21:23 -0400

On Tue, Mar 18, 2025 at 6:49 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Sat, Mar 15, 2025 at 04:49:20PM -0400, Jamal Hadi Salim wrote:
>
> > On "unreliable": This is typically a result of some request response
> > (or a subscribed to event) whose execution has failed to allocate
> > memory in the kernel or overrun some buffers towards user space;
> > however, any such failures are signalled to user space and can be
> > recovered from.
>
> No, they can't be recovered from in all cases.
> Randomly failing system
> calls because of memory pressure is a horrible foundation to build
> what something like RDMA needs. It is not acceptable that something
> like a destroy system call would just randomly fail because the kernel
> is OOMing. There is no recovery from this beyond leaking memory - the
> opposite of what you want in an OOM situation.
>

Curious how you guarantee that a "destroy" will not fail under OOM. Do
you have pre-allocated memory?
Note: Basic request-response netlink messaging like a destroy which
merely returns you a success/fail indication _should not fail_ once
that message hits the kernel consumer (example tc subsystem). A
request to destroy or create something in the kernel for example would
be a fit.
Things that may fail because of memory pressure are requests that
solicit data(typically lots of data) from the kernel, example if you
dump a large kernel table that wont fit in one netlink message, it
will be sent to you/user in multiple messages; somewhere after the
first chunk gets sent your way we may hit an oom issue. For these
sorts of message types, user space will be signalled so it can
recover. "Recover" could be to issue another message to continue where
we left off....

> > ioctl is synchronous which gives it the "reliability" and "speed".
> > iirc, if memory failure was to happen on ioctl it will block until it
> > is successful?
>
> It would fail back to userspace and unwind whatever it did.
>

Very similar with netlink.

> The unwinding is tricky and RDMA's infrastructure has alot of support
> to make it easier for driver writers to get this right in all the
> different error cases.
>
> Overall systems calls here should either succeed or fail and be the
> same as a NOP. No failure that actually did something and then creates
> some resource leak or something because userspace didn't know about
> it.
>

Yes, this is how netlink works as well. If a failure to delete an
object occurs then every transient state gets restored. This is always
the case for simple requests(a delete/create/update). For requests
that batch multiple objects there are cases where there is no
unwinding. Example you could send a request to create a bunch of
objects in the kernel and half way through the kernel fails for
whatever and has to bail out.
Most of the subsystems i have seen as such return a "success", even
though they only succeeded on the first half. Some return a success
with a count of how many objects were created.
It is feasible on a per subsystem level to set flags which would
instruct the kernel of which mode to use, etc.

> > Extensibility: ioctl take binary structs which make it much harder to
> > extend but adds to that "speed". Once you pick your struct, you are
> > stuck with it - as opposed to netlink which uses very extensible
> > formally defined TLVs that makes it highly extensible.
>
> RDMA uses TLVs now too. It has one of the largest uAPI surfaces in the
> kernel, TLVs were introduced for the same reason netlink uses them.
>

Makes sense. So ioctls with TLVs ;->
I am suspecting you don't have concepts of TLVs inside TLVs for
hierarchies within objects.

> RDMA also has special infrastructure to split up the TLV space between
> core code and HW driver code which is a key feature and necessary part
> of how you'd build a user/kernel split driver.
>

The T namespace is split between core code and driver code?
I can see that as being useful for debugging maybe? What else?

> > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > that is now common in generic netlink highly reduces developer effort.
> > Although in my opinion we really need this stuff integrated into tools
> > like iproute2..
>
> RDMA also has a DSL like scheme for defining schema, and centralized
> parsing and validation. IMHO it's capability falls someplace between
> the old netlink policy stuff and the new YAML stuff.
>

I meant the ability to start with a data model and generate code as
being useful.
Where can i find the RDMA DSL?

> But just focusing on schema and TLVs really undersells all the
> specialized infrastructure that exists for managing objects, security,
> HW pass through and other infrastructure things unique to RDMA.
>

I dont know enough about RDMA infra to comment but iiuc, you are
saying that it is the control infrastructure (that sits in
userspace?), that does all those things you mention, that is more
important.
IMO, when you start building complex systems that's always the case
(the "mechanism vs. policy" principle).

cheers,
jamal