On Wednesday, February 3, 2016 1:24:49 PM EST Christian König wrote: > Am 31.01.2016 um 19:50 schrieb Matthew Dawson: > > On Sunday, January 24, 2016 10:49:00 AM EST Christian König wrote: > >> Am 24.01.2016 um 07:18 schrieb Matthew Dawson: > >>> When the radeon driver resets a gpu, it attempts to test whether all the > >>> rings can successfully handle an IB. If these rings fail to respond, > >>> the > >>> process will wait forever. Another gpu reset can't happen at this > >>> point, > >>> as the current reset holds a lock required to do so. Instead, make all > >>> the IB tests run with a timeout, so the system can attempt to recover > >>> in this case. > >>> > >>> While this doesn't fix the underlying issue with card resets failing, it > >>> gives the system a higher chance of recovering. These timeouts have > >>> been > >>> confirmed to help both a Tathi and Hawaii card recover after a gpu > >>> reset. > >>> > >>> I haven't been able to test this on other cards, but everything > >>> compiles. > >>> I wasn't sure what to use for a timeout value, so for now I've used > >>> radeon_lockup_timeout. When that is 0, the timeout functionality in > >>> this > >>> patch is also disabled. > >>> > >>> This also adds a new function, radeon_fence_wait_timeout, that behaves > >>> like > >>> radeon_fence_wait except allowing a different timeout value to be passed > >>> to > >>> radeon_fence_wait_seq_timeout. > >>> > >>> Signed-off-by: Matthew Dawson <matthew@xxxxxxxxxxxxx> > >> > >> Really good idea, but please don't use radeon_lockup_timeout for this > >> cause the 10 seconds default of that is way to long. Instead just use a > >> fixed, let's say 100ms timeout defined in radeon.h. > > > > I originally tried 100ms, but I found that to be too short (my system > > would > > just lockup completely (no network even)). I found 1s is long enough, and > > isn't so long to be annoying. Would that be ok? > > Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like > another bug to me. Probably, but it has a tendency to crash my entire system, so I'm not sure how to debug it further. Netconsole maybe? I probably won't be able to dig into this due to time since 1s does work. > > >> Also don't define our own radeon_fence_wait_timeout, just use > >> fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this. > > > > I switched everything over to this, however the ib test always times out > > after a gpu reset (on startup, everything is normal). I'm not sure why > > fence_wait_timeout isn't being signaled while > > radeon_fence_wait_seq_timeout > > is. I'm suspicious that either radeon_fence_enable_signaling is not doing > > something necessary to get the fence completion to actual signal > > radeon_fence_default_wait, or because we hold the exclusive lock during a > > reset radeon_fence_check_lockup never runs and thus never triggers the > > fence or enables some irq lines. > > > > Is it worth digging through the interactions here to get dma-buf fences to > > work correctly during a lockup, or would it be better to just keep > > radeon_fence_wait_timeout? If option 1 is preferred, I'm happy to learn > > but I need some help learning how it is supposed to work. > > In this case feel free to add the radeon_fence_wait_timeout. It's > probably a good idea to get rid of the exclusive lock sooner or later in > radeon as well, but that really is a long term project. > > If you're really bored I can give you tons of input where to start with > that. That sounds like a good project to learn more about the driver, so I'd be interested. I don't have a lot of time right now to play with it though. I do want to finish off the tf2 crasher first as well. Once I have time I'll poke the list to get started. Thanks for the reviews! -- Matthew
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel