Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 12.08.20 um 17:07 schrieb Felix Kuehling:
Am 2020-08-12 um 4:53 a.m. schrieb Christian König:
Am 12.08.20 um 03:19 schrieb Li, Dennis:
[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,

Re: It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. That's probably going to cause you other
problems later.
[Dennis Li] Thanks that you find the potential issue, I will change
it in version 2.

Re: That makes me wonder, why do you need the reset_sem in
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
hardware access in that function. Is it for amdgpu_ttm_alloc_gart
updating the GART table through HDP?
[Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have
considered to only protect amdgpu_ttm_alloc_gart before.
That access is irrelevant and the lock should be removed or changed
into a trylock.

See we need the HDP flush only because the hardware could have
accessed the data before.

But after a GPU reset the HDP is known to be clean, so this doesn't
need any protection.

   But I worry other functions will access hardware in the future.
Therefore I select an aggressive approach which lock reset_sem at the
beginning of entry functions of amdgpu driver.
This is not a good idea. We used to have such a global lock before and
removed it because it caused all kind of problems.
In most cases it's a global read-lock, so most of the time it should not
impact concurrency. The only "writer" that blocks all readers is the GPU
reset.


When was this added? Looks like it slipped under my radar or I wasn't
awake enough at that moment.
The change that added the reset_sem added read-locks in about 70 places.
I'm still concerned that this will be hard to maintain, to make sure
future HW access will do the necessary locking. I guess that's the
motivation for doing the locking more coarse-grained. Taking the lock
higher up the call chains means fewer places that take the lock and new
low-level code may not need its own locking in the future because it's
already covered by higher-level callers.

On the other hand, it needs to be low enough in the call chains to avoid
recursive locking or circular dependencies with other locks.

Well the whole approach is a NAK. We already tried this and it didn't worked at all.

See how much effort it was to remove the global reset rwsem again in the past.

We have only minimal functions accessing the hardware directly which can run concurrently with a GPU reset.

Everything else works through ring buffers where the processing is stopped before we reset the GPU.

So the whole thing is just a bad idea as far as I can see.

Regards,
Christian.


Regards,
   Felix


Christian.

Best Regards
Dennis Li
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Tuesday, August 11, 2020 9:57 PM
To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking
<Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
dependency

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
[  653.902305] ======================================================
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
G           OE
[  653.904098] ------------------------------------------------------
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
                 but task is already holding lock:
[  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
                 which lock already depends on the new lock.

[  653.909423]
                 the existing dependency chain (in reverse order) is:
[  653.910594]
                 -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]        ww_mutex_lock+0x73/0x80
[  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]        local_pci_probe+0x47/0xa0
[  653.917570]        work_for_cpu_fn+0x1a/0x30
[  653.918184]        process_one_work+0x29e/0x630
[  653.918803]        worker_thread+0x22b/0x3f0
[  653.919427]        kthread+0x12f/0x150
[  653.920047]        ret_from_fork+0x3a/0x50
[  653.920661]
                 -> #0 (&adev->reset_sem){.+.+}:
[  653.921893]        __lock_acquire+0x13ec/0x16e0
[  653.922531]        lock_acquire+0xb8/0x1c0
[  653.923174]        down_read+0x48/0x230
[  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]        drm_ioctl+0x389/0x450 [drm]
[  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]        ksys_ioctl+0x98/0xb0
[  653.927357]        __x64_sys_ioctl+0x1a/0x20
[  653.928030]        do_syscall_64+0x5f/0x250
[  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
                 other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]        CPU0                    CPU1
[  653.933287]        ----                    ----
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]                                lock(&adev->reset_sem);
[  653.935154]
lock(reservation_ww_class_mutex);
[  653.935766]   lock(&adev->reset_sem);
[  653.936360]
                  *** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: ffffb2a862d6bcd0
(reservation_ww_class_acquire){+.+.}, at:
amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem in
amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
to avoid potential dead lock.
It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. That's probably going to cause you other
problems later.

That makes me wonder, why do you need the reset_sem in
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
hardware access in that function. Is it for amdgpu_ttm_alloc_gart
updating the GART table through HDP?

Regards,
    Felix


Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index ee1e8fff83b2..fc889c477696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
void *data,
           abo = NULL;
       }
   +    down_read(&adev->reset_sem);
+
       amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
         r = ttm_eu_reserve_buffers(&ticket, &list, true,
&duplicates); @@
-670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
void *data,
           bo_va = NULL;
       }
   -    down_read(&adev->reset_sem);
-
       switch (args->operation) {
       case AMDGPU_VA_OP_MAP:
           va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
-701,12
+701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
           amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
                       args->operation);
   -    up_read(&adev->reset_sem);
-
   error_backoff:
       ttm_eu_backoff_reservation(&ticket, &list);
     error_unref:
+    up_read(&adev->reset_sem);
       drm_gem_object_put_unlocked(gobj);
       return r;
   }
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux