>>> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem? Well I didn't notice this change explicitly for SRIOV, at least from the code it is quite a normal change >>> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well. If you do the sequence like: load amdgpu and shutdown VM immediately , you will hit problems . Virtual Machine's shutdown won't be blocked if the IB test is on the fly , e.g.: you can do "init 0" right after "modprobe amdgpu" and the "amdgpu_device_fini" won't be called Thus the flushing on IB test won't happen so the IB test is still running with the VM already shutdown and lead to GPU crash (usually VCN/VCE crash on invalid memory address) _____________________________________ Monk Liu|GPU Virtualization Team |AMD -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Monday, June 29, 2020 4:18 PM To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV Am 29.06.20 um 09:11 schrieb Monk Liu: > From: pengzhou <PengJu.Zhou@xxxxxxx> > > issue: > originally we kickoff IB test asynchronously with driver's init, thus > the IB test may still running when the driver loading done (modprobe amdgpu done). > if we shutdown VM immediately after amdgpu driver loaded then GPU may > hang because the IB test is still running > > fix: > make IB test synchronize with driver init thus it won't still running > when we shutdown the VM. We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem? And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well. > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++----- > 1 file changed, 24 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 457f5d2..4f54660 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev, > /* must succeed. */ > amdgpu_ras_resume(adev); > > - queue_delayed_work(system_wq, &adev->delayed_init_work, > + if (amdgpu_sriov_vf(adev)) { > + r = amdgpu_ib_ring_tests(adev); > + if (r) { > + DRM_ERROR("ib ring test failed (%d).\n", r); > + return r; > + } > + } else { > + queue_delayed_work(system_wq, &adev->delayed_init_work, > msecs_to_jiffies(AMDGPU_RESUME_MS)); > + } > > r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes); > if (r) { > @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev) > int r; > > DRM_INFO("amdgpu: finishing device.\n"); > - flush_delayed_work(&adev->delayed_init_work); > + if (!amdgpu_sriov_vf(adev)) > + flush_delayed_work(&adev->delayed_init_work); You can drop this change, flushing a work which was never scheduled is harmless. > adev->shutdown = true; > > /* make sure IB test finished before entering exclusive mode @@ > -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon) > if (fbcon) > amdgpu_fbdev_set_suspend(adev, 1); > > - cancel_delayed_work_sync(&adev->delayed_init_work); > + if (!amdgpu_sriov_vf(adev)) > + cancel_delayed_work_sync(&adev->delayed_init_work); > > if (!amdgpu_device_has_dc_support(adev)) { > /* turn off display hw */ > @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon) > if (r) > return r; > > - queue_delayed_work(system_wq, &adev->delayed_init_work, > + if (amdgpu_sriov_vf(adev)) { > + r = amdgpu_ib_ring_tests(adev); > + if (r) { > + DRM_ERROR("ib ring test failed (%d).\n", r); > + return r; > + } > + } else { > + queue_delayed_work(system_wq, &adev->delayed_init_work, > msecs_to_jiffies(AMDGPU_RESUME_MS)); > + } > > if (!amdgpu_device_has_dc_support(adev)) { > /* pin cursors */ > @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon) > return r; > > /* Make sure IB tests flushed */ > - flush_delayed_work(&adev->delayed_init_work); > + if (!amdgpu_sriov_vf(adev)) > + flush_delayed_work(&adev->delayed_init_work); > > /* blat the mode back in */ > if (fbcon) { _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx