Hi Gary, well that patch is nonsense to begin with. amdgpu_device_init() does quite a bunch of other initialization which is not cleaned up by amdgpu_device_fini(), so the debugfs files are only the tip of the iceberg here. Please revert 2316518efc459928ad1d3d2d3511ea5fbda19475 and then we can try again from scratch. What we need to do is return -EAGAIN from amdgpu_driver_load_kms. Then in amdgpu_pci_probe() we can catch that error and call drm_dev_register() multiple times if necessary. This way we can also optionally pci_disable_device() / pci_enable_device() between tries if appropriate. Regards, Christian. Am 07.11.2017 um 09:02 schrieb Sun, Gary: > Hi Christian, > > The feature is for GPU virtualization and has been checked in, you can refer to the following patch or commit 75b126427778218b36cfb68637e4f8d0e584b8ef. > > From 2316518efc459928ad1d3d2d3511ea5fbda19475 Mon Sep 17 00:00:00 2001 > From: pding <Pixel.Ding at amd.com> > Date: Mon, 23 Oct 2017 17:22:09 +0800 > Subject: [PATCH 001/121] drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3) > > The exclusive mode has real-time limitation in reality, such like being > done in 300ms. It's easy observed if running many VF/VMs in single host > with heavy CPU workload. > > If we find the init fails due to exclusive mode timeout, try it again. > > v2: > - rewrite the condition for readable value. > > v3: > - fix typo, add comments for sleep > > Acked-by: Alex Deucher <alexander.deucher at amd.com> > Signed-off-by: pding <Pixel.Ding at amd.com> > Signed-off-by: Alex Deucher <alexander.deucher at amd.com> > Signed-off-by: Gary Sun <Gary.Sun at amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 15 +++++++++++++-- > 2 files changed, 23 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 125f77d..385b10e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev, > > r = amdgpu_init(adev); > if (r) { > + /* failed in exclusive mode due to timeout */ > + if (amdgpu_sriov_vf(adev) && > + !amdgpu_sriov_runtime(adev) && > + amdgpu_virt_mmio_blocked(adev) && > + !amdgpu_virt_wait_reset(adev)) { > + dev_err(adev->dev, "VF exclusive mode timeout\n"); > + r = -EAGAIN; > + goto failed; > + } > dev_err(adev->dev, "amdgpu_init failed\n"); > amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0); > amdgpu_fini(adev); > @@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, > amdgpu_vf_error_trans_all(adev); > if (runtime) > vga_switcheroo_fini_domain_pm_ops(adev->dev); > + > return r; > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index 720139e..f313eee 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev) > int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) > { > struct amdgpu_device *adev; > - int r, acpi_status; > + int r, acpi_status, retry = 0; > > #ifdef CONFIG_DRM_AMDGPU_SI > if (!amdgpu_si_support) { > @@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) > } > } > #endif > +retry_init: > > adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL); > if (adev == NULL) { > @@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) > * VRAM allocation > */ > r = amdgpu_device_init(adev, dev, dev->pdev, flags); > - if (r) { > + if (r == -EAGAIN && ++retry <= 3) { > + adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME; > + adev->virt.ops = NULL; > + amdgpu_device_fini(adev); > + kfree(adev); > + dev->dev_private = NULL; > + /* Don't request EX mode too frequently which is attacking */ > + msleep(5000); > + dev_err(&dev->pdev->dev, "retry init %d\n", retry); > + goto retry_init; > + } else if (r) { > dev_err(&dev->pdev->dev, "Fatal error during GPU init\n"); > goto out; > }