[AMD Public Use] On Fri, Aug 7, 2020 at 1:59 PM Li, Dennis <Dennis.Li@xxxxxxx> wrote: > > [AMD Public Use] > > Hi, Daniel, > Thanks your review. And I also understand your concern. I guess you missed the description of this issue, so I paste it again in the below, and explain why this issue happens. > > For example, in a XGMI system with 2 GPU devices whose device entity is named adev. And each adev has a separated reset_sem. > // device init function > void device_init(adev) { > ... > init_rwsem(&adev->reset_sem); > ... > } > > XGMI system have two GPU devices, so system will call device_init twice. However the definition of init_rwsem macro has a limitation which use a local static lock_class_key to initialize rw_sem, which cause each adev->reset_sem share the same lock_class_key. > > #define init_rwsem(sem) \ > do { \ > static struct lock_class_key __key; \ > \ > __init_rwsem((sem), #sem, &__key); \ > } while (0) > > And when GPU hang, we should do gpu recovery for all devices in the hive. Therefore we should lock each adev->reset_sem to protect GPU from be accessed by other threads during recovery. The detailed recovery sequence as the following: > // Step1: lock all GPU firstly: > for each adev of GPU0 or GPU1 { > down_write(adev->reset_sem); > do_suspend(adev); > } > > // Step2: > do_hive_recovery(hive); > > // Step3: resume and unlock GPU > for each adev of GPU0 or GPU1 { > do_resume(adev) > up_write(adev->reset_sem); > } > > Each adev->reset_sem has the same lock_class_key, and lockdep will take them as the same rw_semaphore object. Therefore in step1, when lock GPU1, lockdep will pop the below warning. > > I have considered your proposal (using _nest_lock() ) before. Just as you said, _nest_lock() will hide true positive recursive locking. So I gave up it in the end. > > After reviewing codes of lockdep, I found the lockdep support dynamic_key, so using separated lock_class_key has no any side effect. In fact, init_rwsem also use dynamic_key. Please see the call sequence of init_rwsem and lockdep_set_class as the below: > 1) init_rwsem -> __init_rwsem -> lockdep_init_map; > 2) lockdep_set_class -> lockdep_init_map; > > Finally we go back to your concern, you maybe worry this change will cause the below dead-lock can't be detected. In fact, lockdep still is able to detect the issue as circular locking dependency, but there is no warning "recursive locking " in this case. > Thread A: down_write(adev->reset_sem) for GPU 0 -> down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) for GPU 1 -> up_write(adev->reset_sem) for GPU 0 > Thread B: down_write(adev->reset_sem) for GPU 1 -> > down_write(adev->reset_sem) for GPU 0 -> ... -> > up_write(adev->reset_sem) for GPU 0 -> up_write(adev->reset_sem) for > GPU 1 > > But lockdep still can detect recursive locking for this case: > Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...-> > down_write(adev->reset_sem) for GPU 0 Yeah, I guessed correctly what you're doing. My recommendation to use down_write_nest_lock still stands. This is for reset only, kernel-wide reset lock really shouldn't hurt. Or make it a lock per xgmi hive, I'm assuming that information is known somewhere. The problem with manual lockdep annotations is that they increase complexity. You have to keep all the annotations in mind, including justifcations and which parts they still catch and which parts they don't catch. And there's zero performance justification for a gpu reset path to create some fancy lockdep special cases. Locking needs to be 1. maintainable, i.e. every time you need to write a multi-paragraph explanation like the above it's probably not. This obviously includes correctness, but it's even more important that people can easily understand what you're doing. 2. fast enough, where it matters. gpu reset just doesn't. [Dennis Li] Yeah. I strongly agree that manual lockdep annotation will increase complexity. However my patch isn't for lockdep annotation, and in fact it is to fix a bug. According to design of lockdep, every lock object should has a separated lock_class_key which is an unified ID of lock object in lockdep. I still take the previous sample codes for example: we define two rw_sem: a and b, and use case1 and case2 to init them. In my opinion, case1 and case2 should have same behavior, but in fact, they are different, because case2 use init_rwsem macro wrongly. Therefore we should lockdep_set_class to fix this issue, such as case3. Case 1: init_rwsem(&a); init_rwsem(&b); Case2: void init_rwsem_ext(rw_sem* sem) { init_rwsem(sem); } init_rwsem_ext(&a); init_rwsem_ext(&b); Case3: void init_rwsem_ext(rw_sem* sem, lock_class_key *key) { init_rwsem(sem); lockdep_set_class(sem, key); } init_rwsem_ext(&a, a_key); init_rwsem_ext(&b, b_key); > [ 584.110304] ============================================ > [ 584.110590] WARNING: possible recursive locking detected > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > [ 584.111164] -------------------------------------------- > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > but task is already holding lock: > [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > other info that might help us debug this: > [ 584.113689] Possible unsafe locking scenario: > > [ 584.114350] CPU0 > [ 584.114685] ---- > [ 584.115014] lock(&adev->reset_sem); > [ 584.115349] lock(&adev->reset_sem); > [ 584.115678] > *** DEADLOCK *** > > [ 584.116624] May be due to missing lock nesting notation > > [ 584.117284] 4 locks held by kworker/38:1/553: > [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, > at: process_one_work+0x21f/0x630 [ 584.117967] #1: ffffac708e1c3e58 > ((work_completion)(&con->recovery_work)){+.+.}, at: > process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 > (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 > [amdgpu] [ 584.118786] #3: ffff9b1603d247a0 > (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 > [amdgpu] Uh, so this means this rwsem is in the critical path for dma-fence signalling. Please make sure this is all correctly primed at driver load (like we do alredy for dma_resv/fence and drm_modeset_lock) when lockdep is enabled. Otherwise pretty much guaranteed you'll get this wrong somewhere. -Daniel > > Best Regards > Dennis Li > -----Original Message----- > From: Daniel Vetter <daniel@xxxxxxxx> > Sent: Friday, August 7, 2020 5:45 PM > To: Koenig, Christian <Christian.Koenig@xxxxxxx> > Cc: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive > locking > > On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koenig@xxxxxxx> wrote: > > > > Am 07.08.20 um 11:23 schrieb Daniel Vetter: > > > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > >> On Fri, Aug 7, 2020 at 11:08 AM Christian König > > >> <christian.koenig@xxxxxxx> wrote: > > >>> [SNIP] > > >>>>>>> What we should do instead is to make sure we have only a single lock for the complete hive instead. > > >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all devices resuming successfully, but in fact their tasks are only running in device a. It is not friendly to users. > > >>>>>> Well that is intentional I would say. We can only restart submissions after all devices are resumed successfully, cause otherwise it can happen that a job on device A depends on memory on device B which isn't accessible. > > >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the shared resources(such as the shard memory) are ready before unlock the lock of adev one by one. But each device still has private hardware resources such as video and display. > > >>>>> Yeah, but those are negligible and we have a team working on display support for XGMI. So this will sooner or later become a problem as well. > > >>>>> > > >>>>> I still think that a single rwsem for the whole hive is still the best option here. > > >>>>> > > >>>>> [Dennis Li] For your above concern, we can open a new thread to discuss it. If we make a decision to use a single after discussing, we can create another patch for it. > > >>>> That's a really good argument, but I still hesitate to merge this patch. > > >>>> How severe is the lockdep splat? > > >>>> > > >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and how to remove the workaround again. > > >>>> [Dennis Li] It is not a workaround. According to design of lockdep, each rw_semaphore should has a separated lock_class_key. I have explained that init_rwsem has the above limitation, so we must correct it. > > >>> Yeah, but this explanation is not really sufficient. Again this > > >>> is not an limitation of init_rwsem, but intentional behavior. > > >>> > > >>> In other words in most cases lockdep should splat if you use rw > > >>> semaphores like this. > > >>> > > >>> That we avoid this by locking multiple amdgpu device always in > > >>> the same order is rather questionable driver design. > > >> Yeah don't give multiple locking classes like that, that's fairly > > >> terrible. > > > > Ok, that is exactly the answer I feared we would get. > > > > >> The right fix is mutex_lock_nest_lock, which we're using in > > >> ww_mutex or in which is also used in mm_take_all_locks. > > > > Ah! Enlightenment! So in this case we should use down_write_nested() > > in this special space and all is well? > > Nope. There's two kinds of nesting that lockdep supports: > 1. _nested() variants, where you supply an integer parameter specifying the lockdep subclass. That's like setting a locking subclass at lock creating time, except much worse because you change the lockdep class at runtime. This can lead to stuff like: > > struct mutex A; > mutex_init(&A); > mutex_lock(&A); > mutex_lock_nested(&A, SINGLE_DEPTH_NESTING); > > which is a very clear deadlock, but lockdep will think it's all fine. > Don't use that. > > 2. _nest_lock() variants, where you specify a super-lock, which makes sure that all sub-locks cannot be acquired concurrently (or you promise there's some other trick to avoid deadlocks). Lockdep checks that you've acquired the super-lock, and then allows arbitary nesting. > This is what ww_mutex uses, and what mm_take_all_locks uses. If the super-lock is an actual mutex, then this is actually deadlock free. > With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we guarantee deadlock-freeness through the ww algorithm. The fake super lock for ww_mutex is acquired when you get the ticket. > > No 2 is the thing you want, not no 1. > > > >> > > >> If you solve the locking ordering by sorting all the locks you > > >> need (ugh), then you can also just use a fake lockdep_map for > > >> your special function only. > > >> > > >> Also in general the idea of an rwsem in conjunction with xgmi > > >> cross device memory handling just plain freaks me out :-) Please > > >> make sure you prime this lock against dma_resv and dma_fence (and > > >> maybe also drm_modeset_lock if this lock is outside of dma_resv), > > >> and ideally this should only ever be used for setup stuff when > > >> loading drivers. I guess that's why you went with a per-device > > >> rwsem. If it's really only for setup then just do the simple > > >> thing of having an xgmi_hive_reconfigure lock, which all the > > >> rwsems nest within, and no fancy lock ordering or other tricks. > > > > Well this is for the group reset of all devices in a hive and you > > need to reboot to change the order the device are in the list. > > Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and > down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the > global lock for device reset shouldn't be a problem, we're not going > to reset multiple devices in parallel anyway. Or at least I don't hope > that use-case is a perf critical benchmark for you guys :-) > > Cheers, Daniel > > > Regards, > > Christian. > > > > >> > > >>>> Core network driver (net/core/dev.c) has the similar use case with us. > > >>> Just took a look at that, but this is completely different use > > >>> case. The lockdep classes are grouped by driver type to note the > > >>> difference in the network stack. > > >>> > > >>> A quick search showed that no other device driver is doing this > > >>> in the kernel, so I'm not sure if such a behavior wouldn't be > > >>> rejected. Adding Daniel to comment on this as well. > > >>> > > >>> Felix had some really good arguments to make that an urgent fix, > > >>> so either we come up with some real rework of this or at least > > >>> add a big > > >>> "/* TODO....*/". > > >> Aside from "I have questions" I don't think there's any reason to > > >> no fix this correctly with mutex_lock_nest_lock. Really shouldn't > > >> be a semantic change from what I'm understand here. > > > Also one more with my drm maintainer hat on, as a heads-up: > > > Dave&me looked at i915 gem code a bit too much, and we're appalled > > > at the massive over-use of lockdep class hacks and dubious nesting > > > annotations. Expect crack down on anyone who tries to play clever > > > tricks here, we have a few too many already :-) > > > > > > So no mutex_lock_nested (totally different beast from > > > mutex_lock_nest_lock, and really unsafe since it's a runtime > > > change of the lockdep class) and also not any lockdep_set_class > > > trickery or anything like that. We need locking hierarchies which > > > mere humans can comprehend and review. > > > > > > Cheers, Daniel > > > > > >> Cheers, Daniel > > >> > > >>> Regards, > > >>> Christian. > > >>> > > >>>> Regards, > > >>>> Christian. > > >>>> > > >> > > >> -- > > >> Daniel Vetter > > >> Software Engineer, Intel Corporation > > >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2F > > >> bl > > >> og.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd > > >> 86 > > >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% > > >> 7C > > >> 637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPs > > >> ut > > >> BRp%2FS%2BE%3D&reserved=0 > > > > > > > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7Cc9876eefce8e4bb96 > 31008d83acd7273%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324001 > 349319231&sdata=%2BFmRWSamXiOE4HgK8VcTfFEINf31DqMqA5DjClsRkY4%3D&a > mp;reserved=0 -- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7Cc9876eefce8e4bb9631008d83acd7273%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324001349319231&sdata=%2BFmRWSamXiOE4HgK8VcTfFEINf31DqMqA5DjClsRkY4%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx