On Thu, 15 Dec 2016 00:25:13 +0200 "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > On Wed, Dec 14, 2016 at 03:16:37PM -0700, Alex Williamson wrote: > > On Wed, 14 Dec 2016 18:24:23 +0800 > > Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote: > > > > > Sorry for late. > > > after reading all your comments, I think I will try the solution 1. > > > > > > On 12/13/2016 03:12 AM, Alex Williamson wrote: > > > > On Mon, 12 Dec 2016 21:49:01 +0800 > > > > Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote: > > > > > > > >> Hi, > > > >> I have 2 solutions(high level design) came to me, please see if they are > > > >> acceptable, or which one is acceptable. Also have some questions. > > > >> > > > >> 1. block guest access during host recovery > > > >> > > > >> add new field error_recovering in struct vfio_pci_device to > > > >> indicate host recovery status. aer driver in host will still do > > > >> reset link > > > >> > > > >> - set error_recovering in vfio-pci driver's error_detected, used to > > > >> block all kinds of user access(config space, mmio) > > > >> - in order to solve concurrent issue of device resetting & user > > > >> access, check device state[*] in vfio-pci driver's resume, see if > > > >> device reset is done, if it is, then clear"error_recovering", or > > > >> else new a timer, check device state periodically until device > > > >> reset is done. (what if device reset don't end for a long time?) > > > >> - In qemu, translate guest link reset to host link reset. > > > >> A question here: we already have link reset in host, is a second > > > >> link reset necessary? why? > > > >> > > > >> [*] how to check device state: reading certain config space > > > >> register, check return value is valid or not(All F's) > > > > > > > > Isn't this exactly the path we were on previously? > > > > > > Yes, it is basically the previous path, plus the optimization. > > > > > > > There might be an > > > > optimization that we could skip back-to-back resets, but how can you > > > > necessarily infer that the resets are for the same thing? If the user > > > > accesses the device between resets, can you still guarantee the guest > > > > directed reset is unnecessary? If time passes between resets, do you > > > > know they're for the same event? How much time can pass between the > > > > host and guest reset to know they're for the same event? In the > > > > process of error handling, which is more important, speed or > > > > correctness? > > > > > > > > > > I think vfio driver itself won't know what each reset comes for, and I > > > don't quite understand why should vfio care this question, is this a new > > > question in the design? > > > > You're suggesting an optimization to eliminate one of the resets, > > and as we've discussed, I don't see removing the host induced reset > > as a viable option. That means you want to eliminate the guest > > directed reset. There are potentially three levels to do that, the > > vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it > > within the guest. My comments were directed to the first option, the > > host kernel level cannot correlate user directed resets as duplicates > > of host directed resets. > > > > > But I think it make sense that the user access during 2 resets maybe a > > > trouble for guest recovery, misbehaved user could be out of our > > > imagination. Correctness is more important. > > > > > > If I understand you right, let me make a summary: host recovery just > > > does link reset, which is incomplete, so we'd better do a complete guest > > > recovery for correctness. > > > > We don't know whether the host link reset is incomplete, but we can't do > > a link reset transparently to the device, the device is no longer in the > > same state after the reset. The device specific driver, which exists > > in userspace needs to be involved in device recovery. Therefore > > regardless of how QEMU handles the error, the driver within the guest > > needs to be notified and perform recovery. Since the device is PCI and > > we're on x86 and nobody wants to introduce paravirtual error recovery, > > we must use AER. Part of AER recovery includes the possibility of > > performing a link reset. So it seems this eliminates avoiding the link > > reset within the guest. > > > > That leaves QEMU. Here we need to decide whether a guest triggered > > link reset induces a host link reset. The current working theory is > > that yes, this must be the case. If there is ever a case where a > > driver within the guest could trigger a link reset for the purposes > > of error recovery when the host has not, I think this must be the > > case. Therefore, at least some guest induced link resets must become > > host link resets. Currently we assume all guest induced link resets > > become host link resets. Minimally to avoid that, QEMU would need to > > know (not assume) whether the host performed a link reset. Even with > > that, QEMU would need to be able to correlate that a link reset from > > the guest is a duplicate of a link reset that was already performed by > > the host. That implies that QEMU needs to deduce the intention of > > the guest. That seems like a complicated task for a patch series that > > is already complicated enough, especially for a feature of questionable > > value given the configuration restrictions (imo). > > > > I would much rather focus on getting it right and making it as simple > > as we can, even if that means links get reset one too many times on > > error. > > > > > >> 2. skip link reset in aer driver of host kernel, for vfio-pci. > > > >> Let user decide how to do serious recovery > > > >> > > > >> add new field "user_driver" in struct pci_dev, used to skip link > > > >> reset for vfio-pci; add new field "link_reset" in struct > > > >> vfio_pci_device to indicate link has been reset or not during > > > >> recovery > > > >> > > > >> - set user_driver in vfio_pci_probe(), to skip link reset for > > > >> vfio-pci in host. > > > >> - (use a flag)block user access(config, mmio) during host recovery > > > >> (not sure if this step is necessary) > > > >> - In qemu, translate guest link reset to host link reset. > > > >> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET > > > >> is executed > > > >> - In vfio-pci driver's resume, new a timer, check "link_reset" field > > > >> periodically, if it is set in reasonable time, then clear it and > > > >> delete timer, or else, vfio-pci driver will does the link reset! > > > > > > > > What happens in the case of a multifunction device where each function > > > > is part of a separate IOMMU group and one function is hot-removed from > > > > the user? We can't do a link reset on that function since the other > > > > function is still in use. We have no choice but release a device in an > > > > unknown state back to the host. > > > > > > hot-remove from user, do you mean, for example, all functions assigned > > > to VM, then suddenly a person does something like following > > > > > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind > > > > > > $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind > > > > > > to return device to host driver, or don't bind it to host driver, let it > > > in driver-less state??? > > > > Yes, the host kernel has no visiblity to how a user is making use of > > devices. To support AER we require a similar topology between host and > > guest such that a guest link reset translates to a host reset. That > > requirement is imposed by userspace, ie. QEMU. The host kernel cannot > > presume that this is the case. > > So enforce this to enable recovery functionality. Why can't you? How? > > Therefore we could have a > > multi-function device where each function is assigned to the same or > > different users in any configuration. If a fault occurs and we defer > > to the user to perform the link reset, we have absolutely no guarantee > > that it will ever occur. If the functions are assigned to different > > users, then each user individually doesn't have the capability to > > perform a link reset. If the devices happen to be assigned to a single > > user when the error occurs, we cannot assume the user has an AER > > compatible configuration, the devices could be exposed as separate > > single function devices, any one of which might be individually removed > > from the user and made use of by the host, such as your sysfs example > > above. The host cannot perform a link reset in this case either > > as the sibling devices are still in use by the guest. Thanks, > > > > Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html