On Thu, Jun 25, 2015 at 7:02 AM, Gupta, Sourab <sourab.gupta@xxxxxxxxx> wrote: > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@xxxxxxxxx wrote: >> > > From: Sourab Gupta <sourab.gupta@xxxxxxxxx> >> > > >> > > To collect timestamps around any GPU workload, we need to insert >> > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > call, we need to wait for GPU to complete processing the last request for >> > > which these commands were inserted. >> > > We need to ensure this processing is done before event_destroy callback which >> > > deallocates the buffer for holding the data. >> > >> > There's no reason for this to be synchronous. Just that you need an >> > active reference on the output buffer to be only released after the >> > final request. >> >> Yeah I think the interaction between OA sampling and GEM is the critical >> piece here for both patch series. Step one is to have a per-pmu lock to >> keep track of data private to OA and mmio based sampling as a starting >> point. Then we need to figure out how to structure the connection without >> OA/PMU and gem trampling over each another's feet. >From the initial i915_oa code I think the notable locks are: the struct perf_event_context::lock spin lock taken by events/core.c before calling any of our pmu methods (with the exception of event_init and event->destroy()) so we at least don't need to worry about pmu methods trampling each other. I think we should only take struct_mutex to meet the existing assumptions of gem code we're using. Since most pmu methods are in interrupt context our calling into gem is generally constrained to event_init or event->destroy() without having to use workers. we have a pmu spin lock to protect OA register state, notably because besides the pmu methods we also have hooks into gem context switching or pinning/unpinning that may trigger updates to the OACONTROL state vs e.g. disabling OACONTROL in pmu->stop() which may concurrently on different cpus. we have an oa_buffer.flush_lock spin lock since in addition to a hrtimer periodically forwarding samples from OABUFFER to perf, userspace may also explicitly request a flush via fsync() or we might flush at pmu->stop(). As a further detail here, if the hrtimer contends with userspace flushing the hrtimer won't ever actually wait on flush_lock, assuming it's not necessary to then have a back-to-back flush while the buffer will probably be empty. I have to admit I haven't really scrutinized the locking details of Sourab's work yet, but it may make sense to introduce some additional exclusion scheme for managing access to to the fifo of pending RPC commands. I need to double check with Sourab, but I think we'd already discussed some changes to how forwarding of these RPC based reports will be managed where I thought we might be able to avoid the need for a worker and struct_mutex locking while forwarding. I think Sourab may have just preferred to send something out before this refactor to try and solicit broader, high level feedback on his work sooner. I imagine if we do refactor to avoid the worker for forwarding though that will affect the locking details regarding the fifo structure. Regards, - Robert >> >> Wrt requests those are refcounted already and just waiting on them doesn't >> need dev->struct_mutex. That should be all you need, assuming you do >> correctly refcount them ... >> -Daniel > > Hi Daniel/Chris, > > Thanks for your feedback. I acknowledge the issue with increasing > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > of data private to OA in my next version of patches. > > W.r.t. having the synchronous wait during event stop, I thought through > the method of having active reference on output buffer. This will let us > have a delayed buffer destroy (controlled by decrement of output buffer > refcount as and when gpu requests complete). This will be decoupled from > the event stop here. But that is not the only thing at play here. > > The event stop is expected to disable the OA unit (which it is doing > right now). In my usecase, I can't disable OA unit, till the time GPU > processes the MI_RPC requests scheduled already, which otherwise may > lead to hangs. > So, I'm waiting synchronously for requests to complete before disabling > OA unit. > > Also, the event_destroy callback should be destroying the buffers > associated with event. (Right now, in current design, event_destroy > callback waits for the above operations to be completed before > proceeding to destroy the buffer). > > If I try to create a function which destroys the output buffer according > to all references being released, all these operations will have to be > performed together in that function, in this serial order (so that when > refcount drops to 0, i.e. requests have completed, OA unit will be > disabled, and then the buffer destroyed). This would lead to these > operations being decoupled from the event_stop and event_destroy > functions. This may be non-intentional behaviour w.r.t. these callbacks > from userspace perspective. > > Robert, any thoughts? > I think a few pertinent details here that inform how we need to handle this are: 1) Almost all the pmu methods are called in atomic context (except event_init) as they are invoked via events/core.c via an inter-processor interrupt so waiting for a completion in event_destroy isn't really an option for us. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx