On 12/16/2016 06:01 AM, Alex Williamson wrote: > On Thu, 15 Dec 2016 16:50:07 +0200 > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > >> On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote: >>> >>> >>> On 12/15/2016 06:16 AM, Alex Williamson wrote: >>>> On Wed, 14 Dec 2016 18:24:23 +0800 >>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote: >>>> >>>>> Sorry for late. >>>>> after reading all your comments, I think I will try the solution 1. >>>>> >>>>> On 12/13/2016 03:12 AM, Alex Williamson wrote: >>>>>> On Mon, 12 Dec 2016 21:49:01 +0800 >>>>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I have 2 solutions(high level design) came to me, please see if they are >>>>>>> acceptable, or which one is acceptable. Also have some questions. >>>>>>> >>>>>>> 1. block guest access during host recovery >>>>>>> >>>>>>> add new field error_recovering in struct vfio_pci_device to >>>>>>> indicate host recovery status. aer driver in host will still do >>>>>>> reset link >>>>>>> >>>>>>> - set error_recovering in vfio-pci driver's error_detected, used to >>>>>>> block all kinds of user access(config space, mmio) >>>>>>> - in order to solve concurrent issue of device resetting & user >>>>>>> access, check device state[*] in vfio-pci driver's resume, see if >>>>>>> device reset is done, if it is, then clear"error_recovering", or >>>>>>> else new a timer, check device state periodically until device >>>>>>> reset is done. (what if device reset don't end for a long time?) >>>>>>> - In qemu, translate guest link reset to host link reset. >>>>>>> A question here: we already have link reset in host, is a second >>>>>>> link reset necessary? why? >>>>>>> >>>>>>> [*] how to check device state: reading certain config space >>>>>>> register, check return value is valid or not(All F's) >>>>>> >>>>>> Isn't this exactly the path we were on previously? >>>>> >>>>> Yes, it is basically the previous path, plus the optimization. >>>>> >>>>>> There might be an >>>>>> optimization that we could skip back-to-back resets, but how can you >>>>>> necessarily infer that the resets are for the same thing? If the user >>>>>> accesses the device between resets, can you still guarantee the guest >>>>>> directed reset is unnecessary? If time passes between resets, do you >>>>>> know they're for the same event? How much time can pass between the >>>>>> host and guest reset to know they're for the same event? In the >>>>>> process of error handling, which is more important, speed or >>>>>> correctness? >>>>>> >>>>> >>>>> I think vfio driver itself won't know what each reset comes for, and I >>>>> don't quite understand why should vfio care this question, is this a new >>>>> question in the design? >>>> >>>> You're suggesting an optimization to eliminate one of the resets, >>>> and as we've discussed, I don't see removing the host induced reset >>>> as a viable option. That means you want to eliminate the guest >>>> directed reset. There are potentially three levels to do that, the >>>> vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it >>>> within the guest. My comments were directed to the first option, the >>>> host kernel level cannot correlate user directed resets as duplicates >>>> of host directed resets. >>>> >>> >>> Ah, maybe it is mistake, I don't really want to eliminate guest directed >>> reset very much, I was just not sure why it is very necessary. >>> >>> The optimization I said just is fully separating host recovery from >>> guest recovery(timer, check device periodically) in time, because there >>> is concurrent device resetting & user access. >>> >>>>> But I think it make sense that the user access during 2 resets maybe a >>>>> trouble for guest recovery, misbehaved user could be out of our >>>>> imagination. Correctness is more important. >>>>> >>>>> If I understand you right, let me make a summary: host recovery just >>>>> does link reset, which is incomplete, so we'd better do a complete guest >>>>> recovery for correctness. >>>> >>>> We don't know whether the host link reset is incomplete, but we can't do >>>> a link reset transparently to the device, the device is no longer in the >>>> same state after the reset. The device specific driver, which exists >>>> in userspace needs to be involved in device recovery. Therefore >>>> regardless of how QEMU handles the error, the driver within the guest >>>> needs to be notified and perform recovery. Since the device is PCI and >>>> we're on x86 and nobody wants to introduce paravirtual error recovery, >>>> we must use AER. Part of AER recovery includes the possibility of >>>> performing a link reset. So it seems this eliminates avoiding the link >>>> reset within the guest. >>>> >>>> That leaves QEMU. Here we need to decide whether a guest triggered >>>> link reset induces a host link reset. The current working theory is >>>> that yes, this must be the case. If there is ever a case where a >>>> driver within the guest could trigger a link reset for the purposes >>>> of error recovery when the host has not, I think this must be the >>>> case. Therefore, at least some guest induced link resets must become >>>> host link resets. Currently we assume all guest induced link resets >>>> become host link resets. Minimally to avoid that, QEMU would need to >>>> know (not assume) whether the host performed a link reset. Even with >>>> that, QEMU would need to be able to correlate that a link reset from >>>> the guest is a duplicate of a link reset that was already performed by >>>> the host. That implies that QEMU needs to deduce the intention of >>>> the guest. That seems like a complicated task for a patch series that >>>> is already complicated enough, especially for a feature of questionable >>>> value given the configuration restrictions (imo). >>>> >>>> I would much rather focus on getting it right and making it as simple >>>> as we can, even if that means links get reset one too many times on >>>> error. >>>> >>> >>> Thanks very much for your detailed explanation, it does helps me to >>> understand your concern, understand why a second link reset is necessary. >>> >>> I still want to share my thoughts with you(not argue): now we know host >>> aer driver will do link reset for vfio-pci first, so I can say, even if >>> fatal error is link related, after host link reset, link can work now. >>> Then in qemu, we are not necessary to translate guest link reset to host >>> link reset, just use vfio_pci_reset() as it is to do device >>> reset(probably is FLR). Which also means we don't need following >>> patch(make code easier): >>> >>> @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev) >>> >>> trace_vfio_pci_reset(vdev->vbasedev.name); >>> >>> + if (vdev->features & VFIO_FEATURE_ENABLE_AER) { >>> + PCIDevice *br = pci_bridge_get_device(pdev->bus); >>> + >>> + if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) & >>> + PCI_BRIDGE_CTL_BUS_RESET)) { >>> + if (pci_get_function_0(pdev) == pdev) { >>> + vfio_pci_hot_reset(vdev, vdev->single_depend_dev); >>> + } >>> + return; >>> + } >>> + } >>> + >>> vfio_pci_pre_reset(vdev); >>> >>> >>> I think this also implies: we have a virtual link in qemu, but a virtual >>> link will never be broken like a physical link.(In particular we already >>> know host aer driver surely will do link reset to recover physical >>> link). So, guest's link reset don't need to care whether virtual link is >>> reset, just care virtual device. And qemu "translates guest link reset >>> to host link reset" seems kind of taking link-reset responsibility over >>> from host:) >>> >>>>>>> 2. skip link reset in aer driver of host kernel, for vfio-pci. >>>>>>> Let user decide how to do serious recovery >>>>>>> >>>>>>> add new field "user_driver" in struct pci_dev, used to skip link >>>>>>> reset for vfio-pci; add new field "link_reset" in struct >>>>>>> vfio_pci_device to indicate link has been reset or not during >>>>>>> recovery >>>>>>> >>>>>>> - set user_driver in vfio_pci_probe(), to skip link reset for >>>>>>> vfio-pci in host. >>>>>>> - (use a flag)block user access(config, mmio) during host recovery >>>>>>> (not sure if this step is necessary) >>>>>>> - In qemu, translate guest link reset to host link reset. >>>>>>> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET >>>>>>> is executed >>>>>>> - In vfio-pci driver's resume, new a timer, check "link_reset" field >>>>>>> periodically, if it is set in reasonable time, then clear it and >>>>>>> delete timer, or else, vfio-pci driver will does the link reset! >>>>>> >>>>>> What happens in the case of a multifunction device where each function >>>>>> is part of a separate IOMMU group and one function is hot-removed from >>>>>> the user? We can't do a link reset on that function since the other >>>>>> function is still in use. We have no choice but release a device in an >>>>>> unknown state back to the host. >>>>> >>>>> hot-remove from user, do you mean, for example, all functions assigned >>>>> to VM, then suddenly a person does something like following >>>>> >>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind >>>>> >>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind >>>>> >>>>> to return device to host driver, or don't bind it to host driver, let it >>>>> in driver-less state??? >>>> >>>> Yes, the host kernel has no visiblity to how a user is making use of >>>> devices. To support AER we require a similar topology between host and >>>> guest such that a guest link reset translates to a host reset. That >>>> requirement is imposed by userspace, ie. QEMU. The host kernel cannot >>>> presume that this is the case. Therefore we could have a >>>> multi-function device where each function is assigned to the same or >>>> different users in any configuration. If a fault occurs and we defer >>>> to the user to perform the link reset, we have absolutely no guarantee >>>> that it will ever occur. If the functions are assigned to different >>>> users, then each user individually doesn't have the capability to >>>> perform a link reset. If the devices happen to be assigned to a single >>>> user when the error occurs, we cannot assume the user has an AER >>>> compatible configuration, the devices could be exposed as separate >>>> single function devices, any one of which might be individually removed >>>> from the user and made use of by the host, such as your sysfs example >>>> above. The host cannot perform a link reset in this case either >>>> as the sibling devices are still in use by the guest. Thanks, >>>> >>>> Alex >>>> >>>> >>> >>> this explanation is valuable to me, so this is also why we can't do link >>> reset in vfio driver when one of the function is closed. And do link >>> reset in vfio driver until all functions are close is poor solution and >>> very complex(quarantine the device) as you said. >>> >>> I am going to try solution 1, but I still have some consideration share >>> with you, this won't stop my trial, and don't have relationship with >>> above discussion, just FYI: >>> >>> In non-virtuallization environment, from device's perspective, the steps >>> of a normal recovery consists of: >>> error_detect >>> mmio_enabled >>> link_reset >>> slot_reset >>> resume >>> >>> Now in our condition, the steps become: >>> *link_reset* (host's, the following are guest's) >>> error_detect >>> mmio_enabled >>> link_reset >>> slot_reset >>> resume >>> >>> Especially, some device's specific driver in guest could do some >>> specific work in error_detect, take igb_io_error_detected() for example. >>> Like the words in pci-error-recovery.txt said: >>> >>> it gives the driver a chance to cleanup, waiting for pending stuff >>> (timers, whatever, etc...) to complete; >>> >>> But if link_reset is the first step, we lost all the status(register >>> value, etc) in the device. Of course I don't know if this will be a >>> problem (might not), just curious if this has been your concern:) >> >> You'll find I did mention it :) >> >> But consider Documentation/PCI/pcieaer-howto.txt >> >> 3.2.2.2 Non-correctable (non-fatal and fatal) errors >> >> If an error message indicates a non-fatal error, performing link reset >> at upstream is not required. The AER driver calls error_detected(dev, >> pci_channel_io_normal) to all drivers associated within a hierarchy in >> question. for example, >> EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. >> If Upstream port A captures an AER error, the hierarchy consists of >> Downstream port B and EndPoint. >> >> A driver may return PCI_ERS_RESULT_CAN_RECOVER, >> PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on >> whether it can recover or the AER driver calls mmio_enabled as next. >> >> If an error message indicates a fatal error, kernel will broadcast >> error_detected(dev, pci_channel_io_frozen) to all drivers within >> a hierarchy in question. Then, performing link reset at upstream is >> necessary. >> >> I think that if you just forward errors to guests they will get confused. >> I see three possible approaches. >> >> >> 1. Always pretend to guest that there was a fatal error, >> then basically: >> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >> index dce511f..4022f9b 100644 >> --- a/drivers/vfio/pci/vfio_pci.c >> +++ b/drivers/vfio/pci/vfio_pci.c >> @@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, >> >> vfio_device_put(device); >> >> - return PCI_ERS_RESULT_CAN_RECOVER; >> + return PCI_ERS_RESULT_DISCONNECT; >> } >> >> static const struct pci_error_handlers vfio_err_handlers = { >> >> >> probably conditional on userspace invoking some ioctl >> to avoid breaking existing users. >> >> 2. send non fatal error to guest. >> Add another eventfd to distinguish non fatal and fatal errors. >> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >> index dce511f..e22f449 100644 >> --- a/drivers/vfio/pci/vfio_pci.c >> +++ b/drivers/vfio/pci/vfio_pci.c >> @@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, >> >> mutex_lock(&vdev->igate); >> >> - if (vdev->err_trigger) >> + if (state == pci_channel_io_normal && vdev->recover_trigger) >> + eventfd_signal(vdev->recover_trigger, 1); >> + else if (vdev->err_trigger) >> eventfd_signal(vdev->err_trigger, 1); >> >> mutex_unlock(&vdev->igate); >> >> vfio_device_put(device); >> >> return PCI_ERS_RESULT_CAN_RECOVER; >> } >> >> static const struct pci_error_handlers vfio_err_handlers = { >> >> Forward non fatal ones to guest, stop vm on fatal ones. >> >> >> >> 3. forward both non fatal and fatal error to guest >> This includes 1 and 2 above, and >> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c >> index dce511f..4022f9b 100644 >> --- a/drivers/vfio/pci/vfio_pci.c >> +++ b/drivers/vfio/pci/vfio_pci.c >> @@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, >> >> vfio_device_put(device); >> >> - return PCI_ERS_RESULT_CAN_RECOVER; >> + return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER : >> + PCI_ERS_RESULT_DISCONNECT; >> } >> >> static const struct pci_error_handlers vfio_err_handlers = { >> >> Maybe make this conditional on recover_trigger to keep >> compatibility. >> >> >> You seem to be starting from 1. But how about starting small, and doing >> 2 as a first step? Fatal errors will still stop vm. >> This will help you merge a bunch of error reporting infrastructure >> without worrying about recovery so much. >> >> Making some progress finally will be good. >> >> >> Alex, what do you think? > > Well, as you and I discussed on IRC, I'm concerned that none of us > attempting to provide input or submit this code are really subject > matter experts when it comes to AER. We're all stumbling around > trying to do what we think is right, but we're just taking stabs at how > we assume drivers behave and what really needs to happen. Personally > I'm also not convinced that the current complexity involved and > configuration restrictions that the path we're taking implies to the VM > and usage of devices makes for a compelling feature. At what point do > we determine that none of this is worthwhile? > Not sure I got your accurate thoughts, seems you still have some concerns about the current configuration restrictions? FYI: http://www.vmware.com/pdf/vsp_4_vmdirectpath_host.pdf chaper: Problems with Device Assignment Dependencies Seems vmware also adopt basically the same configuration restrictions as us, they just don't mention AER, and the topology in guest, but I guess they have the same restriction as us. So these configuration restrictions isn't invented by us, it is a existed way, do we still need to worry about it? > My priorities are: > * We need to do the right thing for the guest, I don't think we > should be presuming that different reset types are equivalent, > leaving gaps where we expect the guest/host to do a reset and don't > follow through on other reset requests, and we need to notify the > guest immediately for the error. > * We need to do the right thing for the host, that means we should > not give the user the opportunity to leave a device in a state > where we haven't at least performed a bus reset on link error (this > may be our current state and if so we should fix it). > > I do like the idea of taking a step back and trying to determine if > there are some errors that we can handle incrementally from the > existing behavior, but I also don't want to get into a state where a > QEMU user needs to specify what degree of error handling they want and > follow different configuration rules for each type. If the feature > isn't usable because the configuration requirements are too restrictive > or too confusing, perhaps this is not a worthwhile feature. Thanks, > > Alex > I got your concern, so, we reached an agreement on MST's approach 2? -- Sincerely, Cao jin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html