[PATCH 001/121] drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The exclusive mode has real-time limitation in reality, such like being
done in 300ms. It's easy observed if running many VF/VMs in single host
with heavy CPU workload.

If we find the init fails due to exclusive mode timeout, try it again.

v2:
 - rewrite the condition for readable value.

v3:
 - fix typo, add comments for sleep

Acked-by: Alex Deucher <alexander.deucher at amd.com>
Signed-off-by: pding <Pixel.Ding at amd.com>
Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
Signed-off-by: Gary Sun <Gary.Sun at amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   10 ++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   15 +++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 125f77d..385b10e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 
 	r = amdgpu_init(adev);
 	if (r) {
+		/* failed in exclusive mode due to timeout */
+		if (amdgpu_sriov_vf(adev) &&
+		    !amdgpu_sriov_runtime(adev) &&
+		    amdgpu_virt_mmio_blocked(adev) &&
+		    !amdgpu_virt_wait_reset(adev)) {
+			dev_err(adev->dev, "VF exclusive mode timeout\n");
+			r = -EAGAIN;
+			goto failed;
+		}
 		dev_err(adev->dev, "amdgpu_init failed\n");
 		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0);
 		amdgpu_fini(adev);
@@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	amdgpu_vf_error_trans_all(adev);
 	if (runtime)
 		vga_switcheroo_fini_domain_pm_ops(adev->dev);
+
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 720139e..f313eee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
 int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 {
 	struct amdgpu_device *adev;
-	int r, acpi_status;
+	int r, acpi_status, retry = 0;
 
 #ifdef CONFIG_DRM_AMDGPU_SI
 	if (!amdgpu_si_support) {
@@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 		}
 	}
 #endif
+retry_init:
 
 	adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL);
 	if (adev == NULL) {
@@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
 	 * VRAM allocation
 	 */
 	r = amdgpu_device_init(adev, dev, dev->pdev, flags);
-	if (r) {
+	if (r == -EAGAIN && ++retry <= 3) {
+		adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME;
+		adev->virt.ops = NULL;
+		amdgpu_device_fini(adev);
+		kfree(adev);
+		dev->dev_private = NULL;
+		/* Don't request EX mode too frequently which is attacking */
+		msleep(5000);
+		dev_err(&dev->pdev->dev, "retry init %d\n", retry);
+		goto retry_init;
+	} else if (r) {
 		dev_err(&dev->pdev->dev, "Fatal error during GPU init\n");
 		goto out;
 	}
-- 
1.7.1


Regards,
Gary


-----Original Message-----
From: Koenig, Christian 
Sent: Tuesday, November 07, 2017 3:48 PM
To: Sun, Gary <Gary.Sun at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu:remove debugfs file in amdgpu_device_finish

Hi Gary,

not sure what driver re-initialize feature you are talking about, but the last time I tried to re-initialize the driver it deadlocks in the modeset code because of some DC problem.

It's probably a good idea to fix that first, but in general please explain further what are you working on.

Regards,
Christian.

Am 07.11.2017 um 08:23 schrieb Sun, Gary:
> Hi Christian,
>
> The patch is for driver re- initialize  feature, not for driver exit or rmmod.  When the driver initialize failed at some point, the re- initialize  feature will do some little clean and then try to initialize driver again, then it will re-register some registered debugfs , so it will fail.
>
> Regards,
> Gary
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Monday, November 06, 2017 5:26 PM
> To: Sun, Gary <Gary.Sun at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu:remove debugfs file in 
> amdgpu_device_finish
>
> Am 06.11.2017 um 10:20 schrieb Gary Sun:
>> remove debugfs file in amdgpu_device_finish
> NAK, the debugfs files are removed automatically by drm_debugfs_cleanup().
>
> So that patch is unnecessary.
>
> Regards,
> Christian.
>
>> Signed-off-by: Gary Sun <Gary.Sun at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |    1 +
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   18 ++++++++++++++++++
>>    2 files changed, 19 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 4f919d3..6cfcb5f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1250,6 +1250,7 @@ struct amdgpu_debugfs {
>>    int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
>>    			     const struct drm_info_list *files,
>>    			     unsigned nfiles);
>> +int amdgpu_debugfs_cleanup_files(struct amdgpu_device *adev);
>>    int amdgpu_debugfs_fence_init(struct amdgpu_device *adev);
>>    
>>    #if defined(CONFIG_DEBUG_FS)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 7b7439f..ee800ab 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2520,6 +2520,7 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>    	amdgpu_doorbell_fini(adev);
>>    	amdgpu_pm_sysfs_fini(adev);
>>    	amdgpu_debugfs_regs_cleanup(adev);
>> +	amdgpu_debugfs_cleanup_files(adev);
>>    }
>>    
>>    
>> @@ -3304,6 +3305,23 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
>>    	return 0;
>>    }
>>    
>> +int amdgpu_debugfs_cleanup_files(struct amdgpu_device *adev) {
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < adev->debugfs_count; i++) { #if
>> +defined(CONFIG_DEBUG_FS)
>> +		drm_debugfs_remove_files(adev->debugfs[i].files,
>> +				adev->debugfs[i].num_files,
>> +				adev->ddev->primary);
>> +#endif
>> +		adev->debugfs[i].files = NUL;
>> +		adev->debugfs[i].num_files = 0;
>> +	}
>> +	adev->debugfs_count = 0;
>> +	return 0;
>> +}
>> +
>>    #if defined(CONFIG_DEBUG_FS)
>>    
>>    static ssize_t amdgpu_debugfs_regs_read(struct file *f, char 
>> __user *buf,
>



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux