On Thu, Jun 25, 2015 at 9:27 AM, Gupta, Sourab <sourab.gupta@xxxxxxxxx> wrote: > On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote: >> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: >> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@xxxxxxxxx wrote: >> > > > > From: Sourab Gupta <sourab.gupta@xxxxxxxxx> >> > > > > >> > > > > To collect timestamps around any GPU workload, we need to insert >> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > > > call, we need to wait for GPU to complete processing the last request for >> > > > > which these commands were inserted. >> > > > > We need to ensure this processing is done before event_destroy callback which >> > > > > deallocates the buffer for holding the data. >> > > > >> > > > There's no reason for this to be synchronous. Just that you need an >> > > > active reference on the output buffer to be only released after the >> > > > final request. >> > > >> > > Yeah I think the interaction between OA sampling and GEM is the critical >> > > piece here for both patch series. Step one is to have a per-pmu lock to >> > > keep track of data private to OA and mmio based sampling as a starting >> > > point. Then we need to figure out how to structure the connection without >> > > OA/PMU and gem trampling over each another's feet. >> > > >> > > Wrt requests those are refcounted already and just waiting on them doesn't >> > > need dev->struct_mutex. That should be all you need, assuming you do >> > > correctly refcount them ... >> > > -Daniel >> > >> > Hi Daniel/Chris, >> > >> > Thanks for your feedback. I acknowledge the issue with increasing >> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track >> > of data private to OA in my next version of patches. >> >> If you create new locking schemes please coordinate with Rob about them. >> Better if we come up with a consistent locking scheme for all of OA. That >> ofc doesn't exclude that we'll have per-pmu locking, just that the locking >> design should match if it makes sense. The example I'm thinking of is >> drrs&psr which both use the frontbuffer tracking code and both have their >> private mutex. But the overall approach to locking and cleaning up the >> async work is the same. >> > Ok, I'll coordinate with Robert about a consistent locking scheme here. > >> > W.r.t. having the synchronous wait during event stop, I thought through >> > the method of having active reference on output buffer. This will let us >> > have a delayed buffer destroy (controlled by decrement of output buffer >> > refcount as and when gpu requests complete). This will be decoupled from >> > the event stop here. But that is not the only thing at play here. >> > >> > The event stop is expected to disable the OA unit (which it is doing >> > right now). In my usecase, I can't disable OA unit, till the time GPU >> > processes the MI_RPC requests scheduled already, which otherwise may >> > lead to hangs. >> > So, I'm waiting synchronously for requests to complete before disabling >> > OA unit. >> > >> > Also, the event_destroy callback should be destroying the buffers >> > associated with event. (Right now, in current design, event_destroy >> > callback waits for the above operations to be completed before >> > proceeding to destroy the buffer). >> > >> > If I try to create a function which destroys the output buffer according >> > to all references being released, all these operations will have to be >> > performed together in that function, in this serial order (so that when >> > refcount drops to 0, i.e. requests have completed, OA unit will be >> > disabled, and then the buffer destroyed). This would lead to these >> > operations being decoupled from the event_stop and event_destroy >> > functions. This may be non-intentional behaviour w.r.t. these callbacks >> > from userspace perspective. >> > >> > Robert, any thoughts? >> >> Yeah now idea whether those perf callbacks are allowed to stall for a >> longer time. If not we could simply push the cleanup/OA disabling into an >> async work. >> -Daniel > > Hi Daniel, > Actually right now, the stop_event callback is not stalled. Instead I'm > scheduling an async work fn to do the job. The event destroy is then > waiting for the work fn to finish processing, before proceeding on to > destroy the buffer. > I'm also thinking through Chris' suggestions here and will try to come > up with solution based on them. A notable detail here is that most pmu methods are called in atomic context by events/core.c via inter-processor interrupts (as a somewhat unintended side effect of userspace opening i915_oa events as specific-cpu events perf will make sure all methods are invoked on that specific cpu). This means waiting in pmu->event_stop() isn't even an option for us, though as noted it's only the destroy currently that waits for the completion of the stop work fn. event_init() and event->destroy() are two exceptions that are called in process context. That said though, it does seem worth considering if we can avoid waiting in the event_destroy() if not strictly necessary, and chris's suggestion to follow a refcounting scheme sounds workable. Regards, - Robert > > Thanks, > Sourab > _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx