Re: Lockdep spalt on killing a processes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 28.10.21 um 19:26 schrieb Andrey Grodzovsky:

On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:

On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:

On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]

Let me please know if I am still missing some point of yours.

Well, I mean we need to be able to handle this for all drivers.


For sure, but as i said above in my opinion we need to change only for those drivers that don't use the _locked version.

And that absolutely won't work.

See the dma_fence is a contract between drivers, so you need the same calling convention between all drivers.

Either we always call the callback with the lock held or we always call it without the lock, but sometimes like that and sometimes otherwise won't work.

Christian.


I am not sure I fully understand what problems this will cause but anyway, then we are back to irq_work. We cannot embed irq_work as union within dma_fenc's cb_list because it's already reused as timestamp and as rcu head after the fence is signaled. So I will do it within drm_scheduler with single irq_work per drm_sched_entity
as we discussed before.

That won't work either. We free up the entity after the cleanup function. That's the reason we use the callback on the job in the first place.


Yep, missed it.



We could overlead the cb structure in the job though.


I guess, since no one else is using this member it after the cb executed.

Andrey


Attached a patch. Give it a try please, I tested it on my side and tried to generate the right conditions to trigger this code path by repeatedly submitting commands while issuing GPU reset to stop the scheduler and then killing command submissions process in the middle. But for some reason looks like the job_queue was always empty already at the time of entity kill.

It was trivial to trigger with the stress utility I've hacked together:

amdgpu_stress -b v 1g -b g 1g -c 1 2 1g 1k

Then while it is copying just cntrl+c to kill it.

The patch itself is:

Tested-by: Christian König <christian.koenig@xxxxxxx>
Reviewed-by: Christian König <christian.koenig@xxxxxxx>

Thanks,
Christian.


Andrey





Christian.


Andrey




Andrey






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux