On Mon, Jan 20, 2014 at 09:49:24AM +0000, Chris Wilson wrote: > On Sun, Jan 19, 2014 at 10:55:26PM +0100, Daniel Vetter wrote: > > On Sun, Jan 19, 2014 at 10:20 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > > > On older generations (gen2, gen3) the GPU requires fences for many > > > operations, such as blits. The display hardware also requires fences for > > > scanouts and this leads to a situation where an arbitrary number of > > > fences may be pinned by old scanouts following a pageflip but before we > > > have executed the unpin workqueue. This is unpredictable by userspace > > > and leads to random EDEADLK when submitting an otherwise benign > > > execbuffer. However, we can detect when we have an outstanding flip and > > > so cause userspace to wait upon their completion before finally > > > declaring that the system is starved of fences. This is really no worse > > > than forcing the GPU to stall waiting for older execbuffer to retire and > > > release their fences before we can reallocate them for the next > > > execbuffer. > > > > > > Reported-and-tested-by: dimon@xxxxxxx > > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73696 > > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > > New subtest for kms_flip which submits such a blt buffer while a > > pageflip is still pending? > > Correct. > > > Also there's a certain chance we'll starve > > the unpin work, similar to the issues around flushing the unpin work > > in our pageflip implementation. > > If you mean that we will never run the unpin workqueue, that's what the > implementation will fix, eventually, after a busy-spin in userspace since > set_need_resched() was removed. I can teach userspace to yield() after > an EAGAIN which seems a reasonable compromise (userspace gets a bonus > for being cooperative rather than penalized for using up its timeslice.) yield won't help, we need to block on the work-queue draining like we do in the pageflip code with flush_workqueue. At least we've had bug reports in the past where someone found it intriguing to run his entire userspace with rt prio, which ended up starving the sched_normal workqueue and so livelocked the entire system. Instead of busy-looping through userspace with -EAGAIN I think we should keep all the unpin works on a spinlock-protected list and synchronously unpin the buffers in the get_fence and evict_something paths (after the flip completed, we've removed the unpin entry from the list and dropped the spinlock ofc). The only downside is that we have a notch more complexity since we need to manually check for gpu hangs and bail out correctly if there is one. Which means another kms_flip subtest, but that shouldn't be too much fuzz with the combinatorial testflags we already have. Since we don't have a test where rt threads starve our workers for the normal pageflip code I think we can eshew that part here, too. I'll add it to the i-g-t wishlist though for a rainy afternoon ;-) Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx