Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 19 Mar 2025 16:19:46 -0300

On Wed, Mar 19, 2025 at 02:21:23PM -0400, Jamal Hadi Salim wrote:

> Curious how you guarantee that a "destroy" will not fail under OOM. Do
> you have pre-allocated memory?

It just never allocates memory? Why would a simple system call like a
destruction allocate any memory?

> > Overall systems calls here should either succeed or fail and be the
> > same as a NOP. No failure that actually did something and then creates
> > some resource leak or something because userspace didn't know about
> > it.
> 
> Yes, this is how netlink works as well. If a failure to delete an
> object occurs then every transient state gets restored. This is always
> the case for simple requests(a delete/create/update). For requests
> that batch multiple objects there are cases where there is no
> unwinding. 

I'm not sure that is complely true, like if userspace messes up the
netlink read() side of the API and copy_to_user() fails then you can
get these inconsistencies. In the RDMA model even those edge case are
properly unwound, just like a normal system call would.

> Makes sense. So ioctls with TLVs ;->
> I am suspecting you don't have concepts of TLVs inside TLVs for
> hierarchies within objects.

No, it has not been needed yet, or at least the cases that have come
up have been happy to use arrays of structs for the nesting. The
method calls themselves don't tend to have that kind of challenging
structure for their arguments.

> > RDMA also has special infrastructure to split up the TLV space between
> > core code and HW driver code which is a key feature and necessary part
> > of how you'd build a user/kernel split driver.
> 
> The T namespace is split between core code and driver code?
> I can see that as being useful for debugging maybe? What else?

RDMA is all about having a user/kernel driver co-design.  This means a
driver has code in a userspace library and code in the kernel that
work together to implement the functionality. The userspace library
should be thought of as an extension of the kernel driver into
userspace.

So, there is alot of traffic between the two driver components that is
just private and unique to the driver. This is what the driver
namespace is used for.

For instance there is a common method call to create a queue. The
queue has a number of core parameters like depth, and address, then it
calls the driver and there are bunch of device specific parameters
too, like say queue entry format.

Every driver gets to define its own parameters best suited to its own
device and its own user/kernel split.

Building a split user/kernel driver is complicated and uAPI is one of
the biggest challenges :\

> > > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > > that is now common in generic netlink highly reduces developer effort.
> > > Although in my opinion we really need this stuff integrated into tools
> > > like iproute2..
> >
> > RDMA also has a DSL like scheme for defining schema, and centralized
> > parsing and validation. IMHO it's capability falls someplace between
> > the old netlink policy stuff and the new YAML stuff.
> >
> 
> I meant the ability to start with a data model and generate code as
> being useful.
> Where can i find the RDMA DSL?

It is done with the C preprocessor instead of an external YAML
file. Look at drivers/infiniband/core/uverbs_std_types_mr.c at the
end. It describes a data model, but it is elaborated at runtime into
an efficient parse tree, not by using a code generator.

The schema is more classical object oriented RPC type scheme where you
define objects, methods and then method parameters. The objects have
an entire kernel side infrastructure to manage their lifecycle and the
attributes have validation and parsing done prior to reaching the C
function implementing the method.

I always thought it was netlink inspired, but more suited to building
a uAPI out of. Like you get actual system call names (eg
UVERBS_METHOD_REG_DMABUF_MR) that have actual C functions implementing
them. There is special help to implement object allocation and
destruction functions, and freedom to have as many methods per object
as make sense.

> I dont know enough about RDMA infra to comment but iiuc, you are
> saying that it is the control infrastructure (that sits in
> userspace?), that does all those things you mention, that is more
> important.

There is an entire object model in the kernel and it is linked into
the schema.

For instance in the above example we have a schema for an object
method like this:

DECLARE_UVERBS_NAMED_METHOD(
        UVERBS_METHOD_REG_DMABUF_MR,
        UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
                        UVERBS_OBJECT_MR,
                        UVERBS_ACCESS_NEW,
                        UA_MANDATORY),
        UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
                        UVERBS_OBJECT_PD,
                        UVERBS_ACCESS_READ,
                        UA_MANDATORY),

That says it accepts two object handles MR and PD as input to the
method call.

The core code keeps track of all these object handles, validates the
ID number given by userspace is refering to the correct object, of the
correct type, in the correct state. Locks things against concurrent
destruction, and then gives a trivial way for the C method
implementation to pick up the object pointer:

        struct ib_pd *pd =
                uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);

Which can't fail because everything was already checked before we get
here.  This is all designed to greatly simplify and make robust the
method implementations that are often in driver code.

Jason