Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

Jike Song <jike.song@xxxxxxxxx> · Wed, 27 Jan 2016 09:47:25 +0800

On 01/27/2016 06:56 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>  
>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>  
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>  
>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>  
>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>  
>>>>  
>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>> as possible. After another thought, yes we can reuse existing read/write
>>>> flags:
>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>> delivery is required;
>>>  
>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>> deliver in-kernel any time it possibly could?
>>>  
>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>> directly;
>>>>  	- when the variable is false, KVMGT will not register anything.
>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>> will be used to finally reach KVMGT emulation logic;
>>>  
>>> No, that means the interface is entirely dependent on a backdoor through
>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>> region with KVM handled via a provided file descriptor and offset,
>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>  
>>  
>> Could you elaborate this thought? If it can achieve the purpose w/o
>> a kernel exit definitely we can adapt to it. :-)
> 
> I only thought of it when replying to the last email and have been doing
> some research, but we already do quite a bit of synchronization through
> file descriptors.  The kvm-vfio pseudo device uses a group file
> descriptor to ensure a user has access to a group, allowing some degree
> of interaction between modules.  Eventfds and irqfds already make use of
> f_ops on file descriptors to poke data.  So, if KVM had information that
> an MMIO region was backed by a file descriptor for which it already has
> a reference via fdget() (and verified access rights and whatnot), then
> it ought to be a simple matter to get to f_ops->read/write knowing the
> base offset of that MMIO region.  Perhaps it could even simply use
> __vfs_read/write().  Then we've got a proper reference to the file
> descriptor for ownership purposes and we've transparently jumped across
> modules without any implicit knowledge of the other end.  Could it work?

This is OK for KVMGT, from fops to vgpu device-model would always be simple.
The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?

copy-and-paste the current implementation of vcpu_mmio_write(), seems
nothing but GPA and len are provided:

	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
				   const void *v)
	{
		int handled = 0;
		int n;

		do {
			n = min(len, 8);
			if (!(vcpu->arch.apic &&
			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
				break;
			handled += n;
			addr += n;
			len -= n;
			v += n;
		} while (len);

		return handled;
	}

If we back a GPA range with a fd, this will also be a 'backdoor'?

> Thanks,
> 
> Alex
> 

--
Thanks,
Jike
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html