Hi Christian, I give a quick try according to your suggestion. It also works and cleaner. I will send a new patch to revise the retry_init. Please help reviewing later. â?? Sincerely Yours, Pixel On 08/11/2017, 10:40 AM, "Ding, Pixel" <Pixel.Ding at amd.com> wrote: >Hi Christian, > >The retry_init only handles the failure caused by exclusive mode timeout. It checks the MMIO to see if thereâ??s exclusive mode timeout, and retry init if thereâ??s, otherwise just return error. > >For exclusive timeout case, the host layer issues a FLR on this VF so driver needn't cleanup hardware status, amdgpu_device_fini here just is used to cleanup the software. > >Itâ??s tested and proved working correctly. Although the debugfs files are only the tip of the iceberg, itâ??s the only issue we found in this version of retry_init. > >â?? >Sincerely Yours, >Pixel > > > > > > > >On 07/11/2017, 5:56 PM, "Koenig, Christian" <Christian.Koenig at amd.com> wrote: > >>Hi Gary, >> >>well that patch is nonsense to begin with. >> >>amdgpu_device_init() does quite a bunch of other initialization which is >>not cleaned up by amdgpu_device_fini(), so the debugfs files are only >>the tip of the iceberg here. >> >>Please revert 2316518efc459928ad1d3d2d3511ea5fbda19475 and then we can >>try again from scratch. >> >>What we need to do is return -EAGAIN from amdgpu_driver_load_kms. Then >>in amdgpu_pci_probe() we can catch that error and call >>drm_dev_register() multiple times if necessary. >> >>This way we can also optionally pci_disable_device() / >>pci_enable_device() between tries if appropriate. >> >>Regards, >>Christian. >> >>Am 07.11.2017 um 09:02 schrieb Sun, Gary: >>> Hi Christian, >>> >>> The feature is for GPU virtualization and has been checked in, you can refer to the following patch or commit 75b126427778218b36cfb68637e4f8d0e584b8ef. >>> >>> From 2316518efc459928ad1d3d2d3511ea5fbda19475 Mon Sep 17 00:00:00 2001 >>> From: pding <Pixel.Ding at amd.com> >>> Date: Mon, 23 Oct 2017 17:22:09 +0800 >>> Subject: [PATCH 001/121] drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3) >>> >>> The exclusive mode has real-time limitation in reality, such like being >>> done in 300ms. It's easy observed if running many VF/VMs in single host >>> with heavy CPU workload. >>> >>> If we find the init fails due to exclusive mode timeout, try it again. >>> >>> v2: >>> - rewrite the condition for readable value. >>> >>> v3: >>> - fix typo, add comments for sleep >>> >>> Acked-by: Alex Deucher <alexander.deucher at amd.com> >>> Signed-off-by: pding <Pixel.Ding at amd.com> >>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com> >>> Signed-off-by: Gary Sun <Gary.Sun at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 15 +++++++++++++-- >>> 2 files changed, 23 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 125f77d..385b10e 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev, >>> >>> r = amdgpu_init(adev); >>> if (r) { >>> + /* failed in exclusive mode due to timeout */ >>> + if (amdgpu_sriov_vf(adev) && >>> + !amdgpu_sriov_runtime(adev) && >>> + amdgpu_virt_mmio_blocked(adev) && >>> + !amdgpu_virt_wait_reset(adev)) { >>> + dev_err(adev->dev, "VF exclusive mode timeout\n"); >>> + r = -EAGAIN; >>> + goto failed; >>> + } >>> dev_err(adev->dev, "amdgpu_init failed\n"); >>> amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0); >>> amdgpu_fini(adev); >>> @@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, >>> amdgpu_vf_error_trans_all(adev); >>> if (runtime) >>> vga_switcheroo_fini_domain_pm_ops(adev->dev); >>> + >>> return r; >>> } >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> index 720139e..f313eee 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>> @@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev) >>> int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) >>> { >>> struct amdgpu_device *adev; >>> - int r, acpi_status; >>> + int r, acpi_status, retry = 0; >>> >>> #ifdef CONFIG_DRM_AMDGPU_SI >>> if (!amdgpu_si_support) { >>> @@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) >>> } >>> } >>> #endif >>> +retry_init: >>> >>> adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL); >>> if (adev == NULL) { >>> @@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) >>> * VRAM allocation >>> */ >>> r = amdgpu_device_init(adev, dev, dev->pdev, flags); >>> - if (r) { >>> + if (r == -EAGAIN && ++retry <= 3) { >>> + adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME; >>> + adev->virt.ops = NULL; >>> + amdgpu_device_fini(adev); >>> + kfree(adev); >>> + dev->dev_private = NULL; >>> + /* Don't request EX mode too frequently which is attacking */ >>> + msleep(5000); >>> + dev_err(&dev->pdev->dev, "retry init %d\n", retry); >>> + goto retry_init; >>> + } else if (r) { >>> dev_err(&dev->pdev->dev, "Fatal error during GPU init\n"); >>> goto out; >>> } >> >>