RFC: KVM _CREATE_DEVICE considered harmful?

Christian Borntraeger <christian@xxxxxxxxxxxxxxx> · Wed, 16 Oct 2013 14:59:47 +0200

Folks,

from time to time I update valgrind or qemu to work reasonably well
with KVM.

Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
kvm interrupt controller) with the following ioctl:

#define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)

qemu can then work on these devices with the ioctls 

/* ioctls for fds returned by KVM_CREATE_DEVICE */
#define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
#define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)

struct kvm_device_attr {
        __u32   flags;          /* no flags currently defined */
        __u32   group;          /* device-defined */
        __u64   attr;           /* group-defined */
        __u64   addr;           /* userspace address of attr data */
};

to communicate with the in-kvm devices to exchange data. There is an interesting
problem here: Properly defined ioctls allow to deduct the written or read area
of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
decode the ioctl number according to the _IOxx prefixes and deducts the memory
area that the kernels reads/writes. This is very helpful for definedness checkins.

For specific or more complex ioctls valgrind has special handling. The problem is
now, that looking at the current implementations of kvm device attr, all use 0,1,2,
etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
it is talking to without tracking the original create ioctl.

So it seems that by using a multiplexing ioctls we actually make the interface
less well defined and more error prone than individual ioctls.

Another hint to look at are system calls. Older archs (like i386) have an IPC system
call that multiplexes several things. But time has shown that having a system for each
and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
have an ipc system call they have semop, semget etc. This also makes tools like strace
easier to implement.

Whats even more puzzling to me: The main complaint about ioctls is the possibility to
create arbitrary interfaces into the kernel. Now with the IORW family, we actually
had some limited form of control. With an additional uncoordinated group/attr 
parameter we dropped all of that.

So how to process from here? 
a) leave it as is and accept false positives from valgrind and undecoded straces
b) We could enhance the device_addr group/attr values with size macros, e.g
the same as _IOWR (do we need direction?) for future users
c) Avoid the device api for new code and go back to individual ioctls in KVM

The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
should really have a 2nd look at the Linux system call ABI. Does it have to follow the
device model? especially if the interface has obvious draw backs.

Christian
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html