On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@xxxxxxxxx wrote: > > > > From: Sourab Gupta <sourab.gupta@xxxxxxxxx> > > > > > > > > To collect timestamps around any GPU workload, we need to insert > > > > commands to capture them into the ringbuffer. Therefore, during the stop event > > > > call, we need to wait for GPU to complete processing the last request for > > > > which these commands were inserted. > > > > We need to ensure this processing is done before event_destroy callback which > > > > deallocates the buffer for holding the data. > > > > > > There's no reason for this to be synchronous. Just that you need an > > > active reference on the output buffer to be only released after the > > > final request. > > > > Yeah I think the interaction between OA sampling and GEM is the critical > > piece here for both patch series. Step one is to have a per-pmu lock to > > keep track of data private to OA and mmio based sampling as a starting > > point. Then we need to figure out how to structure the connection without > > OA/PMU and gem trampling over each another's feet. > > > > Wrt requests those are refcounted already and just waiting on them doesn't > > need dev->struct_mutex. That should be all you need, assuming you do > > correctly refcount them ... > > -Daniel > > Hi Daniel/Chris, > > Thanks for your feedback. I acknowledge the issue with increasing > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > of data private to OA in my next version of patches. If you create new locking schemes please coordinate with Rob about them. Better if we come up with a consistent locking scheme for all of OA. That ofc doesn't exclude that we'll have per-pmu locking, just that the locking design should match if it makes sense. The example I'm thinking of is drrs&psr which both use the frontbuffer tracking code and both have their private mutex. But the overall approach to locking and cleaning up the async work is the same. > W.r.t. having the synchronous wait during event stop, I thought through > the method of having active reference on output buffer. This will let us > have a delayed buffer destroy (controlled by decrement of output buffer > refcount as and when gpu requests complete). This will be decoupled from > the event stop here. But that is not the only thing at play here. > > The event stop is expected to disable the OA unit (which it is doing > right now). In my usecase, I can't disable OA unit, till the time GPU > processes the MI_RPC requests scheduled already, which otherwise may > lead to hangs. > So, I'm waiting synchronously for requests to complete before disabling > OA unit. > > Also, the event_destroy callback should be destroying the buffers > associated with event. (Right now, in current design, event_destroy > callback waits for the above operations to be completed before > proceeding to destroy the buffer). > > If I try to create a function which destroys the output buffer according > to all references being released, all these operations will have to be > performed together in that function, in this serial order (so that when > refcount drops to 0, i.e. requests have completed, OA unit will be > disabled, and then the buffer destroyed). This would lead to these > operations being decoupled from the event_stop and event_destroy > functions. This may be non-intentional behaviour w.r.t. these callbacks > from userspace perspective. > > Robert, any thoughts? Yeah now idea whether those perf callbacks are allowed to stall for a longer time. If not we could simply push the cleanup/OA disabling into an async work. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx