On Thu, Oct 05, 2017 at 03:30:12PM +0100, Chris Wilson wrote: > Quoting Daniel Vetter (2017-10-05 15:09:48) > > stop_machine is not really a locking primitive we should use, except > > when the hw folks tell us the hw is broken and that's the only way to > > work around it. > > > > This patch here is just a suggestion for how to fix it up, possible > > changes needed to make it actually work: > > > > - Set the nop_submit_request first for _all_ engines, before > > proceeding. > > > > - Make sure engine->cancel_requests copes with the possibility that > > not all tests have consistently used the new or old version. I dont > > think this is a problem, since the same can happen really with the > > stop_machine() locking - stop_machine also doesn't give you any kind > > of global ordering against other cpu threads, it just makes them > > stop. > > > > This patch tries to address the locking snafu from > > There's a locking snafu in the code? > > > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5 > > Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Date: Tue Nov 22 14:41:21 2016 +0000 > > > > drm/i915: Stop the machine as we install the wedged submit_request handler > > > > Chris said parts of the reasons for going with stop_machine() was that > > it's no overhead for the fast-path. > > More than that, you don't even have to think about it. It's a one off > event that changes execution paths. I actually never thought about > putting the lock mechanism around the caller (that does prevent the issue > I was dreading of being inside the callback as it changed), it is still > magic that has nothing to do with the code flow. What variable should we > document as being rcu protected, (*engine->submit_request)()? Yeah, engine->submit_request would be the one, except it shouldn't use rcu_assign_pointer/rcu_dereference since we don't need those barriers, ever - the .text doesn't change after all. That's why I didn't splatter the sparse annotations all over it. In a way we don't need the data barriers of rcu, but really only the "everyone has passed the critical section now" part of it. > I'm definitely not sold on having set-wedge dictate terms to the rest of > the code. Well, it sounds like tglx would very much prefer we clean this locking inversion up on our side and not get them to break the cycle. Not what I wanted Thomas to look at, but since I botched it and attached the wrong splat we got the answer for this patch here, and it seems to be "don't do that". -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx