Re: [RFC PATCH v1 08/16] drm/radeon: use common fence implementation for fences

Christian König <deathsimple@xxxxxxxxxxx> · Thu, 15 May 2014 17:48:05 +0200

Am 15.05.2014 16:18, schrieb Maarten Lankhorst:
op 15-05-14 15:19, Christian König schreef:
Am 15.05.2014 15:04, schrieb Maarten Lankhorst:
op 15-05-14 11:42, Christian König schreef:
Am 15.05.2014 11:38, schrieb Maarten Lankhorst:
op 15-05-14 11:21, Christian König schreef:
Am 15.05.2014 03:06, schrieb Maarten Lankhorst:
op 14-05-14 17:29, Christian König schreef:
+    /* did fence get signaled after we enabled the sw irq? */
+    if 
(atomic64_read(&fence->rdev->fence_drv[fence->ring].last_seq) 
>= fence->seq) {
+        radeon_irq_kms_sw_irq_put(fence->rdev, fence->ring);
+        return false;
+    }
+
+    fence->fence_wake.flags = 0;
+    fence->fence_wake.private = NULL;
+    fence->fence_wake.func = radeon_fence_check_signaled;
+ __add_wait_queue(&fence->rdev->fence_queue, 
&fence->fence_wake);
+    fence_get(f);
That looks like a race condition to me. The fence needs to be 
added to the wait queue before the check, not after.

Apart from that the whole approach looks like a really bad idea 
to me. How for example is lockup detection supposed to happen 
with this? 
It's not a race condition because fence_queue.lock is held when 
this function is called.
Ah, I see. That's also the reason why you moved the wake_up_all 
out of the processing function.
Correct. :-)
Lockup's a bit of a weird problem, the changes wouldn't allow 
core ttm code to handle the lockup any more,
but any driver specific wait code would still handle this. I did 
this by design, because in future patches the wait
function may be called from outside of the radeon driver. The 
official wait function takes a timeout parameter,
so lockups wouldn't be fatal if the timeout is set to something 
like 30*HZ for example, it would still return
and report that the function timed out.
Timeouts help with the detection of the lockup, but not at all 
with the handling of them.

What we essentially need is a wait callback into the driver that 
is called in non atomic context without any locks held.

This way we can block for the fence to become signaled with a 
timeout and can then also initiate the reset handling if necessary.

The way you designed the interface now means that the driver 
never gets a chance to wait for the hardware to become idle and 
so never has the opportunity to the reset the whole thing.
You could set up a hangcheck timer like intel does, and end up 
with a reliable hangcheck detection that doesn't depend on cpu 
waits. :-) Or override the default wait function and restore the 
old behavior.

Overriding the default wait function sounds better, please 
implement it this way.

Thanks,
Christian. 

Does this modification look sane?
Adding the timeout is on my todo list for quite some time as well, so 
this part makes sense.

+static long __radeon_fence_wait(struct fence *f, bool intr, long 
timeout)
+{
+    struct radeon_fence *fence = to_radeon_fence(f);
+    u64 target_seq[RADEON_NUM_RINGS] = {};
+
+    target_seq[fence->ring] = fence->seq;
+    return radeon_fence_wait_seq_timeout(fence->rdev, target_seq, 
intr, timeout);
+}
When this call is comming from outside the radeon driver you need to 
lock rdev->exclusive_lock here to make sure not to interfere with a 
possible reset.
Ah thanks, I'll add that.

     .get_timeline_name = radeon_fence_get_timeline_name,
     .enable_signaling = radeon_fence_enable_signaling,
     .signaled = __radeon_fence_signaled,
Do we still need those callback when we implemented the wait callback?
.get_timeline_name is used for debugging (trace events).
.signaled is the non-blocking call to check if the fence is signaled 
or not.
.enable_signaling is used for adding callbacks upon fence completion, 
the default 'fence_default_wait' uses it, so
when it works no separate implementation is needed unless you want to 
do more than just waiting.
It's also used when fence_add_callback is called. i915 can be patched 
to use it. ;-)

I just meant enable_signaling, the other ones are fine with me. The 
problem with enable_signaling is that it's called with a spin lock held, 
so we can't sleep.

While resetting the GPU could be moved out into a timer the problem here 
is that I can't lock rdev->exclusive_lock in such situations.

This means when i915 would call into radeon to enable signaling for a 
fence we can't make sure that there is not GPU reset running on another 
CPU. And touching the IRQ registers while a reset is going on is a 
really good recipe to lockup the whole system.

Christian.

~Maarten

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel