On 13-Oct-22 2:17 AM, Peter Zijlstra wrote: > On Wed, Oct 12, 2022 at 02:16:29PM +0200, Peter Zijlstra wrote: > >> That's the intent yeah. But due to not always holding ctx->mutex over >> put_pmu_ctx() this might be moot. I'm almost through auditing epc usage >> and I think ctx->lock is sufficient, fingers crossed. > > So the very last epc usage threw a spanner into the works and made > things complicated. > > Specifically sys_perf_event_open()'s group_leader case uses > event->pmu_ctx while only holding ctx->mutex. Therefore we can't fully > let go of ctx->mutex locking and purely rely on ctx->lock. > > Now the good news is that the annoying put_pmu_ctx() without holding > ctx->mutex case doesn't actually matter here. Since we hold a reference > on the group_leader (per the filedesc) the event can't go away, > therefore it must have a pmu_ctx, and then holding ctx->mutex ensures > the pmu_ctx is stable -- iow it serializes against > sys_perf_event_open()'s move_group and perf_pmu_migrate_context() > changing the epc around. > > So we're going with the normal mutex+lock for modification rule, but > allow the weird put_pmu_ctx() exception. > > I have the below delta. > > I'm hoping we can call this done -- I'm going to see if I can bribe Mark > to take a look at the arm64 thing soon and then hopefully queue the > whole thing once -rc1 happens. That should give us a good long soak > until the next merge window. Sounds good. Thanks for all the help! I've glanced through the changes and they looks fine, below are few minor points. > + * Specificially, sys_perf_event_open()'s group_leader case depends on > + * ctx->mutex pinning the configuration. Since we hold a reference on > + * group_leader (through the filedesc) it can't fo away, therefore it's typo: can't go away > - refcount_t refcount; > + refcount_t refcount; /* event <-> ctx */ Ok. We need to remove all those // XXX get/put_ctx() from code which we added to make refcount a pmu_ctx <-> ctx. > +#define double_list_for_each_entry(pos1, pos2, head1, head2, member) \ > + for (pos1 = list_first_entry(head1, typeof(*pos1), member), \ > + pos2 = list_first_entry(head2, typeof(*pos2), member); \ > + !list_entry_is_head(pos1, head1, member) && \ > + !list_entry_is_head(pos2, head2, member); \ > + pos1 = list_next_entry(pos1, member), \ > + pos2 = list_next_entry(pos2, member)) > + > static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx, > struct perf_event_context *next_ctx) While this is unrelated to this patch, shouldn't we also need to swap event->hw.target? A purely hypothetical scenario: Consider two processes having clone contexts (for example, two children of the same parent). While process switch between these two, the perf event context would get swapped but event->hw.target will point to other sibling's task_struct. If any one process exit just after single context swap, _free_event() will call put_task_context() on sibling process' task_struct. > @@ -12436,6 +12463,9 @@ SYSCALL_DEFINE5(perf_event_open, > * Allow the addition of software events to hw > * groups, this is safe because software events > * never fail to schedule. > + * > + * Note the comment that goes with struct > + * pmu_event_pmu_context. typo: perf_event_pmu_context The good (or bad? ;)) news is, perf test and Vince's perf_event_tests are running fine without any regression on my machine. Thanks, Ravi