On Thu, Jun 25, 2015 at 9:02 AM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: >> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@xxxxxxxxx wrote: >> > > > From: Sourab Gupta <sourab.gupta@xxxxxxxxx> >> > > > >> > > > To collect timestamps around any GPU workload, we need to insert >> > > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > > call, we need to wait for GPU to complete processing the last request for >> > > > which these commands were inserted. >> > > > We need to ensure this processing is done before event_destroy callback which >> > > > deallocates the buffer for holding the data. >> > > >> > > There's no reason for this to be synchronous. Just that you need an >> > > active reference on the output buffer to be only released after the >> > > final request. >> > >> > Yeah I think the interaction between OA sampling and GEM is the critical >> > piece here for both patch series. Step one is to have a per-pmu lock to >> > keep track of data private to OA and mmio based sampling as a starting >> > point. Then we need to figure out how to structure the connection without >> > OA/PMU and gem trampling over each another's feet. >> > >> > Wrt requests those are refcounted already and just waiting on them doesn't >> > need dev->struct_mutex. That should be all you need, assuming you do >> > correctly refcount them ... >> > -Daniel >> >> Hi Daniel/Chris, >> >> Thanks for your feedback. I acknowledge the issue with increasing >> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track >> of data private to OA in my next version of patches. >> >> W.r.t. having the synchronous wait during event stop, I thought through >> the method of having active reference on output buffer. This will let us >> have a delayed buffer destroy (controlled by decrement of output buffer >> refcount as and when gpu requests complete). This will be decoupled from >> the event stop here. But that is not the only thing at play here. >> >> The event stop is expected to disable the OA unit (which it is doing >> right now). In my usecase, I can't disable OA unit, till the time GPU >> processes the MI_RPC requests scheduled already, which otherwise may >> lead to hangs. >> So, I'm waiting synchronously for requests to complete before disabling >> OA unit. > > There's no issue here. Just hook into the retire-notification callback > for the last active request of the oa buffer. If you are using LRI to > disable the OA Context writing to the buffer, you can simply create a > request for the LRI and use the active reference as the retire > notification. (Otoh, if the oa buffer is inactive you can simply do the > decoupling immediately.) You shouldn't need a object-free-notification > callback aiui. > >> Also, the event_destroy callback should be destroying the buffers >> associated with event. (Right now, in current design, event_destroy >> callback waits for the above operations to be completed before >> proceeding to destroy the buffer). >> >> If I try to create a function which destroys the output buffer according >> to all references being released, all these operations will have to be >> performed together in that function, in this serial order (so that when >> refcount drops to 0, i.e. requests have completed, OA unit will be >> disabled, and then the buffer destroyed). This would lead to these >> operations being decoupled from the event_stop and event_destroy >> functions. This may be non-intentional behaviour w.r.t. these callbacks >> from userspace perspective. > > When the perf event is stopped, you stop generating perf events. But the > buffer may still be alive, and may be resurrected if you a new event is > started, but you will want to be sure to ensure that pending oa writes > are ignored. > -Chris Just to summarise some things I know Sourab discussed with Chris on IRC and I then also discussed with Sourab offline. It does look like we can avoid any waiting in event->destroy(), decoupling tracking of outstanding RPC commands+oacontrol disable and the event active status. There's no strict reason OACONTROL needs to be disabled when stopping or destroying an event if there are still pending RPC commands, we just need to stop forwarding samples while disabled and discard reports that land after an event is destroyed. We can also update the hw.state and active state before OACONTROL is disabled, whereas currently the code is updating this state outside of core perf's context lock which isn't good and we might also end up continuing to forward periodic samples from the oabuffer to perf when the user expects the event to be disabled. There was a discussion of updating oacontrol via LRIs, but my concern here is with wanting to update oacontrol state for periodic sampling use cases where it would seem awkward to handle that via LRIs. Enabling/disabling periodic sampling can be considered orthogonal from fully disabling oacontrol and it will be useful to use periodic sampling in conjunction with RPC sampling. When we stop an event we should be able to immediately stop periodic sampling even if we must leave OA enabled for a pending RPC command. If the event is started again we can quickly re-enable periodic sampling but it might be awkward to consider an in-flight LRI we know is going to disable oacontrol soon. Related to using LRIs though was the idea of using the LRI request to help manage the lifetime of the destination buffer, by adding the vmas of the dest buffer to the request's active list. It seems like we should be able to do this kind of thing with the requests associated with the MI_RPC commands instead. Thinking about the details of waiting for the last RPC command before destroying the dest buffer and disabling OACONTROL these are the requirements I see: - we want free the dest buffer in a finite time, since it's large (i.e. don't want to assume it's ok to keep around if allocated once) - must wait for last RPC commands to complete before disabling oacontrol (and can free the dest buffer at the same point) - in any subsequent event_init() we need to be able to verify we don't /still/ have pending RPC commands, because an incompatible oacontrol requirement at this point is a corner case where we really do have to block and wait for those RPC commands to complete or say return -EBUSY. Without the retire-notification api that Chris has been experimenting with (that could provide us a convenient callback when our RPC commands retire), the current plan is to still schedule some work in event->destroy() to wait for the last outstanding RPC command to retire before disabling oacontrol as well as free the RPC destination buffer, but notably there's no need to block on its completion. I imagine this could later easily be adapted to work with a retire-notification api, and maybe Sourab could even experiment with Chris's wip code for this to compare. In the end, I think the only time we'll really need to block waiting for outstanding requests is in event_init() in cases where there are still pending RPC commands and the previous oacontrol report-format is smaller than the newly requested format (which should basically be never I guess if in practice, most users will be requesting the largest report format... and technically I suppose there are even lots of cases where you could safely allow the report size to increase if you know there's adequate space in the old dest buffer; though probably not worth fussing about.). Besides avoiding blocking in event->destroy(); I'd been thinking we wouldn't need the worker for forwarding the RPC reports to perf and could instead just use a hrtimer like we do for periodic samples. Talking this through with Sourab though, he tried this, and it looks like it would be awkward due to needing to drop the references on request structs which may in-turn need to take struct_mutex to finally free. In terms of locking while forwarding samples and in the insert_cmd hooks, for the fifo structures tracking pending commands we should stop using struct_mutex (except if necessary for calling into gem) and we should also avoid locking around all of the forwarding work with any mutex that (if anything) just needs to protect updates to the fifo itself. I guess there's some reasonable lock free way to handle adding and removing from these fifos, but haven't considered that carefully and don't have a strong opinion on particular details here. Ok I think that pretty much summarises what was discussed offline, Regards, - Robert > > -- > Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx