The ring test of the first compute ring always fails and it shouldn't affect the GPU reset in any way. I can't tell if the deadlock issue is fixed, because the GPU reset usually fails with your patch. It always succeeded without your patch. Marek On Wed, Oct 9, 2013 at 1:09 PM, Christian König <deathsimple@xxxxxxxxxxx> wrote: > Mhm, that doesn't looks like anything related but more like the reset of the > compute ring didn't worked. > > How often does that happen? And do you still get the problem where X waits > for a fence that never comes back? > > Christian. > > Am 09.10.2013 12:36, schrieb Marek Olšák: > >> I'm afraid your patch sometimes causes the GPU reset to fail, which >> had never happened before IIRC. >> >> The dmesg log from the failure is attached. >> >> Marek >> >> On Tue, Oct 8, 2013 at 6:21 PM, Christian König <deathsimple@xxxxxxxxxxx> >> wrote: >>> >>> Hi Marek, >>> >>> please try the attached patch as a replacement for your signaling all >>> fences >>> patch. I'm not 100% sure if it fixes all issues, but it's at least a >>> start. >>> >>> Thanks, >>> Christian. >>> >>> Am 07.10.2013 13:08, schrieb Christian König: >>> >>>>> First of all, I can't complain about the reliability of the hardware >>>>> GPU reset. It's mostly the kernel driver that happens to run into a >>>>> deadlock at the same time. >>>> >>>> >>>> Alex and I spend quite some time on making this reliable again after >>>> activating more rings and adding VM support. The main problem is that I >>>> couldn't figure out where the CPU deadlock comes from, cause I couldn't >>>> reliable reproduce the issue. >>>> >>>> What is the content of /proc/<pid of X server>/task/*/stack and >>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in >>>> the >>>> deadlock situation? >>>> >>>> I'm pretty sure that we nearly always have a problem when two threads >>>> are >>>> waiting for fences on one of them detects that we have a lockup while >>>> the >>>> other one keeps holding the exclusive lock. Signaling all fences might >>>> work >>>> around that problem, but it probably would be better to fix the >>>> underlying >>>> issue. >>>> >>>> Going to take a deeper look into it. >>>> >>>> Christian. >>>> >>>> Am 03.10.2013 02:45, schrieb Marek Olšák: >>>>> >>>>> First of all, I can't complain about the reliability of the hardware >>>>> GPU reset. It's mostly the kernel driver that happens to run into a >>>>> deadlock at the same time. >>>>> >>>>> Regarding the issue with fences, the problem is that the GPU reset >>>>> completes successfully according to dmesg, but X doesn't respond. I >>>>> can move the cursor on the screen, but I can't do anything else and >>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can >>>>> easily reproduce this, because it's the most common reason why a GPU >>>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I >>>>> can't tell whether the fences are just not signalled or whether there >>>>> is actually a real CPU deadlock I can't see. >>>>> >>>>> This patch makes the problem go away and GPU resets are successful >>>>> (except for extreme cases, see below). With a small enough lockup >>>>> timeout, the lockups are just a minor annoyance and I thought I could >>>>> get through a piglit run just with a few tens or hundreds of GPU >>>>> resets... >>>>> >>>>> A different type of deadlock showed up, though it needs a lot of >>>>> concurrently-running apps like piglit. What happened is that the >>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due >>>>> to a GPU hang while holding onto the exclusive lock, and another >>>>> thread wanting to do the GPU reset was unable to acquire the lock. >>>>> >>>>> That said, I will use the patch locally, because it helps a lot. I got >>>>> a few lockups while writing this email and I'm glad I didn't have to >>>>> reboot. >>>>> >>>>> Marek >>>>> >>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König >>>>> <deathsimple@xxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> Possible, but I would rather guess that this doesn't work because the >>>>>> IB >>>>>> test runs into a deadlock situation and so the GPU reset never fully >>>>>> completes. >>>>>> >>>>>> Can you reproduce the problem? >>>>>> >>>>>> If you want to make GPU resets more reliable I would rather suggest to >>>>>> remove the ring lock dependency. >>>>>> Then we should try to give all the fence wait functions a (reliable) >>>>>> timeout and move reset handling a layer up into the ioctl functions. >>>>>> But for >>>>>> this you need to rip out the old PM code first. >>>>>> >>>>>> Christian. >>>>>> >>>>>> Marek Olšák <maraeo@xxxxxxxxx> schrieb: >>>>>> >>>>>>> I'm afraid signalling the fences with an IB test is not reliable. >>>>>>> >>>>>>> Marek >>>>>>> >>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König >>>>>>> <deathsimple@xxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> NAK, after recovering from a lockup the first thing we do is >>>>>>>> signalling all remaining fences with an IB test. >>>>>>>> >>>>>>>> If we don't recover we indeed signal all fences manually. >>>>>>>> >>>>>>>> Signalling all fences regardless of the outcome of the reset creates >>>>>>>> problems with both types of partial resets. >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>> Marek Olšák <maraeo@xxxxxxxxx> schrieb: >>>>>>>> >>>>>>>>> From: Marek Olšák <marek.olsak@xxxxxxx> >>>>>>>>> >>>>>>>>> After a lockup, fences are not signalled sometimes, causing >>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>>>>>>> in an X server freeze. >>>>>>>>> >>>>>>>>> This fixes only one of many deadlocks which can occur during a >>>>>>>>> lockup. >>>>>>>>> >>>>>>>>> Signed-off-by: Marek Olšák <marek.olsak@xxxxxxx> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>>>>>>> 1 file changed, 5 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>> index 841d0e0..7b97baa 100644 >>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>>>>>>>> *rdev) >>>>>>>>> radeon_save_bios_scratch_regs(rdev); >>>>>>>>> /* block TTM */ >>>>>>>>> resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>>>>>>> + >>>>>>>>> + mutex_lock(&rdev->ring_lock); >>>>>>>>> + radeon_fence_driver_force_completion(rdev); >>>>>>>>> + mutex_unlock(&rdev->ring_lock); >>>>>>>>> + >>>>>>>>> radeon_pm_suspend(rdev); >>>>>>>>> radeon_suspend(rdev); >>>>>>>>> >>>>>>>>> -- >>>>>>>>> 1.8.1.2 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> dri-devel mailing list >>>>>>>>> dri-devel@xxxxxxxxxxxxxxxxxxxxx >>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>> >>>> >>>> _______________________________________________ >>>> dri-devel mailing list >>>> dri-devel@xxxxxxxxxxxxxxxxxxxxx >>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>> >>> > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel