Re: [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/2/20 6:23 PM, Bjorn Helgaas wrote:
On Wed, Sep 02, 2020 at 02:42:06PM -0400, Andrey Grodzovsky wrote:
Cache the PCI state on boot and before each case where we might
loose it.
s/loose/lose/

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
to avoid superflous restores during GPU resets and suspend/resumes.

v4: Style fixes.
Is the DRM convention to keep the v2/v3/v4 stuff in the commit log?  I
keep those below the "---" or manually remove them for PCI, but use
the local convention, of course.

+	/* Have stored pci confspace at hand for restore in sudden PCI error */
I assume that at least from the perspective of this code, all PCI
errors are "sudden".  Or if they're not, I'm curious about which would
be sudden and which would not.


Yes, all PCI errors are sudden from drivers point of view indeed, I am just stressing the fact that i have to 'plan ahead' and store working PCI config because to my understating by the time when we are notified of the PCI error in err_detected callback the device and it's PCI confspace registers are already inaccessible  and so it's to late to try and save the confspace at this time.



+	if (amdgpu_device_cache_pci_state(adev->pdev))
+		pci_restore_state(pdev);
+bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	r = pci_save_state(pdev);
+	if (!r) {
+		kfree(adev->pci_state);
+
+		adev->pci_state = pci_store_saved_state(pdev);
+
+		if (!adev->pci_state) {
+			DRM_ERROR("Failed to store PCI saved state");
+			return false;
+		}
+	} else {
+		DRM_WARN("Failed to save PCI state, err:%d\n", r);
+		return false;
+	}
+
+	return true;
+}
+
+bool amdgpu_device_load_pci_state(struct pci_dev *pdev)
+{
+	struct drm_device *dev = pci_get_drvdata(pdev);
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int r;
+
+	if (!adev->pci_state)
+		return false;
+
+	r = pci_load_saved_state(pdev, adev->pci_state);
I'm a little bit hesitant to pci_load_saved_state() and
pci_store_saved_state() being used here, simply because they're
currently only used by VFIO, Xen, and nvme.  So I don't have a real
objection, but just pointing out that apparently you're doing
something really special that isn't commonly used and tested, so it's
more likely to be broken or incomplete.


See my assumption above, I assume I have to explicitly save (cache) the PCI state while the device is fully operational since I assume it's not possible when PCI error occurs and DPC isolates the device. I do spot PCI core calling pci_dev_save_and_disable and pci_dev_restore as part of pcie_do_recovery->pci_reset_bus and those functions do store and restore the PCI confpsace but I am not sure this code executes in all cases and pcie_do_recovery->pci_reset_bus
was just added now by last sathyanarayanan's patch he gave me anyway...

Andrey



There's lots of state that the PCI core *can't* save/restore, and
pci_load_saved_state() doesn't even handle all the architected PCI
state, i.e., we only call pci_add_cap_save_buffer() or
pci_add_ext_cap_save_buffer() for a few of the capabilities.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux