On Tue, May 06, 2014 at 05:46:01PM +0300, Imre Deak wrote: > On Tue, 2014-05-06 at 12:59 +0100, Chris Wilson wrote: > > On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote: > > > On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote: > > > > On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote: > > > > > Currently user space can access GEM buffers mapped to GTT through > > > > > existing mappings concurrently while the platform specific suspend > > > > > handlers are running. Since these handlers may change the HW state in a > > > > > way that would break such accesses, remove the mappings before calling > > > > > the handlers. > > > > > > > > Hmm, but you never locked the device, so what is preventing those > > > > concurrent accesses from refaulting in GTT entires anyway. Please explain > > > > the context under which the runtime suspend code executes, and leave > > > > that explanation within easy reach of intel_runtime_suspend() - > > > > preferrably with testing of those assumptions. > > > > > > During faulting we take an RPM reference, so that avoids concurrent > > > re-faults. I could add a comment about this to the code. > > > > You are still iterating over lists that are not safe, right? Or are > > there more magic rpm references that prevent ioctls whilst > > intel_runtime_suspend is being called? > > Tbh I haven't checked this, since moving the unmapping earlier doesn't > make a difference in this regard. > > But it's a good point, I tried to audit now those paths. Currently the > assumption is that we hold an RPM reference everywhere where those lists > are changed. On the exec buffer path this is true, but for example in > i915_drop_caches_set() it's not. > > We could fix this by taking struct_mutex around > i915_gem_release_all_mmaps() in intel_runtime_suspend(). Doing that > needs some more auditing to make sure we can't deadlock somehow. At > first glance it seems that the driver always schedules a deferred work > and calls intel_runtime_suspend() from that, so I think it's fine. > > I suggest applying this patch regardless since the two issues are > separate. If I understand the situation correctly the runtime suspend function only ever gets called from a worker thread after the hysteris timeout expired. Which means we should be able to wrap _just_ the gtt pte shotdown with dev->struct_mutex and nothing else. Which is good since profileration of dev->struct_mutex is awful. On the resume side we don't need any locking since the gtt fault handler will first grab the runtime reference and also dev->struct_mutex. One issue which is looming is that this might deadlock. We might need a trylock in the runtime suspend function and abort the runtime suspend if we can't get the lock. Please test that lockdep catches this before we commit to a design. Just a very quick analysis, I didn't check the details so this might be horribly wrong. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx