RFC: Ioctl v2

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Fri, 20 May 2022 12:16:52 -0400

At LSF we had our annual talk about all the ways ioctls suck. It got me thinking
about a proposal for a simple lightweight replacement.

Problems with ioctls:

* There's no namespacing, and ioctl numbers clash

Ioctl numbers clashing isn't a _huge_ issue in practice because you'll only have
so many chunks of code handling ioctls for the same FD (VFS, filesystem or
driver) and because ioctl struct size is also dispatched on, but it is pretty
gross - there's nothing preventing different drivers from picking the same ioctl
numbers. And since we've got one byte for the "namespace" and another byte for
the ioctl number, and according to my grep 7k different ioctls, betcha this
happens somewhere - I think Luis had an example at LSF.

Where the lack of real namespacing bites us more is when ioctls get promoted
from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
promoted to the VFS when it really shouldn't have, because it was exposing ext2
specific data structures.

But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
it's really easy to do without adequate review - you don't have to change the
ioctl number and break userspace, so why would you?

Introducing real namespacing would mean that promoting an ioctl to the VFS level
would really have to be a new ioctl, and it'll get people to think more about
what the new ioctl would be.

* The calling convention sucks

With ioctls, you have to define a struct for your parameters, and struct members
might be used for inputs, or outputs, or both.

The problem is, these structs really need to be fully portable the same way
structs defined for on disk formats have to be, and we've got no way of checking
for this. This is a real minefield: if you need to pass a pointer type, you
can't pass a pointer because sizeof(void *) is different (and kernel space might
be 64 bit, with userspace 32 bit or 64 bit) - and you can't pass a ulong either,
it has to be a u64.

The whole "define a struct for your parameters" was a hack and a bad idea.
Ioctls are just function calls - they're driver-private syscalls - and they
should work like function calls.

IOCTL v2 proposal:

* Namespacing

To solve the namespacing issue, I want to steal an approach I've seen from
OpenGL, where extensions are namespaced: an extension will be referenced by name
where the name is vendor.foo, and when an extension becomes standard it gets a
new name (arb.foo instead of nvidia.foo, I think? it's been awhile).

To do this we'll need to define ioctls by name, not by hardcoded number, and
likewise userspace will have to call ioctls by name, not by number. To avoid a
string lookup on every ioctl call, I propose a new syscall

int sys_get_ioctl_nr(char *name)

And then userspace will just call this once for every ioctl it uses, either at
program startup or lazily when an ioctl is first called. This can all be nicely
hidden in a little wrapper library.

We'll want to randomize ioctl numbers in kernel space, to ensure userspace
_can't_ hard code them.

Also, another thing that came up at LSF was introspection, it's hard for
strace() et al to handle ioctls. Implementing this name -> nr mapping will give
us a registry of ioctls supported on a given kernel which we can make available
in /proc; and while we're at it, why not include the prototype too?

* Better calling convention

ioctls are just private syscalls. Syscalls look like normal function calls, why
can't ioctls?  Some ioctls do complicated things that require defining structs
with all the tricky layout rules that we kernel devs have all had beaten into
our brains - but most probably would not, if we could do normal-looking function
calls.

Well, syscalls do require arch specific code to handle calling conventions, and
we don't want that. What I propose doing is having the underlying syscall be

#define IOCTL_MAXARGS	8

struct ioctl_args {
	__u64	args[IOCTL_MAXARGS];
};

int sys_ioctl_v2(int fd, int ioctl_nr, struct ioctl_args __user *args)

Userspace won't call this directly. Userspace will call normal looking
functions, like:

int bcachefs_ioctl_disk_add(int fd, unsigned flags, char __user *disk_path);

Which will be a wrapper that casts the function arguments to u64s (or s64s for
signed integers, so that we don't have surprises when kernel space and user
space disagree about sizeof(long)) and then does the actual syscall.

------------------

I want to circulate this and get some comments and feedback, and if no one
raises any serious objections - I'd love to get collaborators to work on this
with me. Flame away!