At LSF we had our annual talk about all the ways ioctls suck. It got me thinking about a proposal for a simple lightweight replacement. Problems with ioctls: * There's no namespacing, and ioctl numbers clash Ioctl numbers clashing isn't a _huge_ issue in practice because you'll only have so many chunks of code handling ioctls for the same FD (VFS, filesystem or driver) and because ioctl struct size is also dispatched on, but it is pretty gross - there's nothing preventing different drivers from picking the same ioctl numbers. And since we've got one byte for the "namespace" and another byte for the ioctl number, and according to my grep 7k different ioctls, betcha this happens somewhere - I think Luis had an example at LSF. Where the lack of real namespacing bites us more is when ioctls get promoted from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting promoted to the VFS when it really shouldn't have, because it was exposing ext2 specific data structures. But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC, it's really easy to do without adequate review - you don't have to change the ioctl number and break userspace, so why would you? Introducing real namespacing would mean that promoting an ioctl to the VFS level would really have to be a new ioctl, and it'll get people to think more about what the new ioctl would be. * The calling convention sucks With ioctls, you have to define a struct for your parameters, and struct members might be used for inputs, or outputs, or both. The problem is, these structs really need to be fully portable the same way structs defined for on disk formats have to be, and we've got no way of checking for this. This is a real minefield: if you need to pass a pointer type, you can't pass a pointer because sizeof(void *) is different (and kernel space might be 64 bit, with userspace 32 bit or 64 bit) - and you can't pass a ulong either, it has to be a u64. The whole "define a struct for your parameters" was a hack and a bad idea. Ioctls are just function calls - they're driver-private syscalls - and they should work like function calls. IOCTL v2 proposal: * Namespacing To solve the namespacing issue, I want to steal an approach I've seen from OpenGL, where extensions are namespaced: an extension will be referenced by name where the name is vendor.foo, and when an extension becomes standard it gets a new name (arb.foo instead of nvidia.foo, I think? it's been awhile). To do this we'll need to define ioctls by name, not by hardcoded number, and likewise userspace will have to call ioctls by name, not by number. To avoid a string lookup on every ioctl call, I propose a new syscall int sys_get_ioctl_nr(char *name) And then userspace will just call this once for every ioctl it uses, either at program startup or lazily when an ioctl is first called. This can all be nicely hidden in a little wrapper library. We'll want to randomize ioctl numbers in kernel space, to ensure userspace _can't_ hard code them. Also, another thing that came up at LSF was introspection, it's hard for strace() et al to handle ioctls. Implementing this name -> nr mapping will give us a registry of ioctls supported on a given kernel which we can make available in /proc; and while we're at it, why not include the prototype too? * Better calling convention ioctls are just private syscalls. Syscalls look like normal function calls, why can't ioctls? Some ioctls do complicated things that require defining structs with all the tricky layout rules that we kernel devs have all had beaten into our brains - but most probably would not, if we could do normal-looking function calls. Well, syscalls do require arch specific code to handle calling conventions, and we don't want that. What I propose doing is having the underlying syscall be #define IOCTL_MAXARGS 8 struct ioctl_args { __u64 args[IOCTL_MAXARGS]; }; int sys_ioctl_v2(int fd, int ioctl_nr, struct ioctl_args __user *args) Userspace won't call this directly. Userspace will call normal looking functions, like: int bcachefs_ioctl_disk_add(int fd, unsigned flags, char __user *disk_path); Which will be a wrapper that casts the function arguments to u64s (or s64s for signed integers, so that we don't have surprises when kernel space and user space disagree about sizeof(long)) and then does the actual syscall. ------------------ I want to circulate this and get some comments and feedback, and if no one raises any serious objections - I'd love to get collaborators to work on this with me. Flame away!