Re: RFC: KVM _CREATE_DEVICE considered harmful?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 16/10/13 17:44, Gleb Natapov wrote:
> On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote:
>> Folks,
>>
>> from time to time I update valgrind or qemu to work reasonably well
>> with KVM.
>>
>> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an in kernel
>> kvm interrupt controller) with the following ioctl:
>>
>> #define KVM_CREATE_DEVICE         _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>>
>> qemu can then work on these devices with the ioctls 
>>
>> /* ioctls for fds returned by KVM_CREATE_DEVICE */
>> #define KVM_SET_DEVICE_ATTR       _IOW(KVMIO,  0xe1, struct kvm_device_attr)
>> #define KVM_GET_DEVICE_ATTR       _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>> #define KVM_HAS_DEVICE_ATTR       _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>>
>> struct kvm_device_attr {
>>         __u32   flags;          /* no flags currently defined */
>>         __u32   group;          /* device-defined */
>>         __u64   attr;           /* group-defined */
>>         __u64   addr;           /* userspace address of attr data */
>> };
>>
>> to communicate with the in-kvm devices to exchange data. There is an interesting
>> problem here: Properly defined ioctls allow to deduct the written or read area
>> of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind will
>> decode the ioctl number according to the _IOxx prefixes and deducts the memory
>> area that the kernels reads/writes. This is very helpful for definedness checkins.
>>
> And is not very helpful if you need to change structure size or it
> changes by itself because of 32bit->64bit move.

Well, it has pros and cons. The con is, that you need all kind of compat handling
when changing structures, the pro is, that unintentional changes are detected if
the size changes. 

> 
>> For specific or more complex ioctls valgrind has special handling. The problem is
>> now, that looking at the current implementations of kvm device attr, all use 0,1,2,
>> etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what device
>> it is talking to without tracking the original create ioctl.
>>
>> So it seems that by using a multiplexing ioctls we actually make the interface
>> less well defined and more error prone than individual ioctls.
> I do not see why it is more error prone. Yes, it is harder for valgrind
> to track it, but this by itself does not make it error prone.

By itself it looks like not more error prone. But it is easier to to mess up - e.g. 
making a structure bigger would cause an immediate error with old user space in case of
ioctls, but changing the target structure of get/set_attr would be unnoticed. On 
the other hand, changing the structure without a size change is not detected - so yeah
the extra checking is not fully sufficient.

> 
>>
>> Another hint to look at are system calls. Older archs (like i386) have an IPC system
>> call that multiplexes several things. But time has shown that having a system for each
>> and every subfunctions is better that a multiplexer. (so newer archs like amd64 dont
>> have an ipc system call they have semop, semget etc. This also makes tools like strace
>> easier to implement.
>>
> The difference is in amount of those subfunctions. We can define
> separate ioctl for each cpu/device register of each architecture and it
> will be nicely straceable, but how much new ioctls this will create? It
> will easily reach 1000s quickly. So at some point you start multiplex.
> ONE_REG interface is multiplexer, DEVICE_ATTR is same.

Agreed sometimes you need multiplexing. ONE_REG seems like a good example.
> 
>> Whats even more puzzling to me: The main complaint about ioctls is the possibility to
>> create arbitrary interfaces into the kernel. Now with the IORW family, we actually
>> had some limited form of control. With an additional uncoordinated group/attr 
>> parameter we dropped all of that.
> How this control is actually used in the kernel?

Only implicit, a change in the argument would cause a different ioctl number. 
> 
>>
>>
>> So how to process from here? 
>> a) leave it as is and accept false positives from valgrind and undecoded straces
> strace still cannot decode kvm ioctls after all this years. I am not
> concerned about straces at all, it is less then useless to debug KVM.

Yes, the code of strace of totally messed up. :-( 
But on my fedora 18 it actually decodes some of the KVM ioctls (e.g. KVM_RUN).


> 
>> b) We could enhance the device_addr group/attr values with size macros, e.g
>> the same as _IOWR (do we need direction?) for future users
> I am fine with that. But even in KVM this is not the first interface
> that has this drawback. See KVM_GET_DIRTY_LOG for instance.

There is additional difference. _IOW(KVMIO,  0x42, struct kvm_dirty_log) is 
unique, so valgrind can have special code to handle that. 1,2,3 is not unique.



> 
>> c) Avoid the device api for new code and go back to individual ioctls in KVM
> Individual ioctls were a mess. People are adding devices without even
> knowing beforehand how final interface will look like. To accommodate
> such "design" strategy devices need to have as fine grained interfaces as
> possible otherwise ioctls will be deprecated faster than added (that was
> justification for ONE_REG too BTW). If you want fine grained interface
> you need to multiplex at some point or add hundreds of ioctls, if you
> need to multiplex anyway multiplexing at device level is logical thing
> to do. I understand valgrind sentiment, but it should not be a central
> point of interface design.

I agree, that valgrind should not be a primary target for an interface.
I also agree that multiplexing at the right place is necessary and good
(e.g. ONE_REG). Multiplexing at a device level also looked right when Alex
showed me the interface - maybe its just the problem with valgrind that
makes me feel different now.

I more or less wanted to bring up this point to start the discussion, if we can
make this interface better (thats why the subject has RFC and "?")
Valgrind was the trigger that actually made me look closer. 
In the end we might just leave it as is, if we cant make the forward/backward
compatibility without lots of compat code.
We can always provide a suppression file for qemu under valgrind.

> 
>>
>> The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model everything 
>> as a device. Having an in-kernel KVM device seemed logical. The problem is, that we
>> should really have a 2nd look at the Linux system call ABI. Does it have to follow the
>> device model? especially if the interface has obvious draw backs.
>>
> Try to add new Linux system call. It's very hard and for a good reason,
> you do not want to have 1000s of them where most of then is one-trick
> pony and mostly deprecated (but you cannot remove them or reuse them
> anyway). If we apply the same constrains to kvm ioctl additions as Linux
> has for system calls, KVM development will stall and people will be less
> then happy.

Agree and disagree :-) ioctls have a bad reputation because they were used to create pony
interfaces in the past but ioctls are ABI as well.


> 
> --
> 			Gleb.

Thanks for your quick response.

Christian 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux