Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program

Namhyung Kim <namhyung@xxxxxxxxxx> · Tue, 11 Feb 2025 17:45:54 -0800

On Tue, Feb 11, 2025 at 03:26:04PM -0800, Chun-Tse Shao wrote:
> Hi Namhyung, thanks for your review first!
> 
> 
> On Fri, Jan 31, 2025 at 3:48 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >
> > On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> > > Tracks lock contention by tracing owner callstacks in
> > > `contention_begin()` and `contention_end()`, storing data in the
> > > owner_stat BPF map. `contention_begin()` records the owner and their
> > > callstack. `contention_end()` updates contention statistics (count,
> > > time), decrements the waiter count, and removes the record when no
> > > waiters remain. Statistics are also updated if the owner's callstack
> > > changes. Note that owner and its callstack retrieval may fail.
> > >
> > > To elaborate the process in detail:
> > >   /*
> > >    * In `contention_begin(), the current task is the lock waiter`. We
> > >    * create/update `owner_data` for the given `lock` address.
> > >   contention_begin() {
> > >     Try to get owner task. Skip entire process if fails.
> > >     Try to get owner stack based on task. Use empty stack if fails.
> > >     Store owner stack into `owner_stacks` and create `stack_id`. If fail
> > >       to store, use negative `stack_id`, which will be ignored while
> > >       reporting in usermode.
> > >     Retrieve `owner_tracing_data` in `owner_data` with given `lock`
> > >       address.
> > >
> > >     /*
> > >      * The first case means contention just happens, or mismatched owner
> > >      * infomation so we just drop the previous record.
> > >      */
> > >     if (`owner_tracing_data` does not exist ||
> > >         the recorded owner `pid` does not match with the detected owner) {
> > >       Create `owner_tracing_data` with info from detected owner, and
> > >         store it in `owner_data` with key `lock` address.
> > >     }
> > >     /*
> > >      * The second case means contention is on going. One more waiter is
> > >      * joining the lock contention. Both `owner_data` and `owner_stat`
> > >      * should be updated.
> > >      */
> > >     else {
> > >       `owner_tracing_data.count`++
> > >
> > >       Create `contention_key` with owner `stack_id` and lookup
> > >         `contention_data` in `owner_stat`.
> > >       if (`contention_data` does not exist) {
> > >         Create new entry for `contention_key`:`contention_data` in
> > >           `owner_stat`.
> > >       }
> > >       else {
> > >         Update the `count` and `total_time` in existing
> > >         `contention_data`.
> > >       }
> > >
> > >       Update `timestamp` and `stack_id` in `owner_tracing_data`.
> > >     }
> > >   }
> > >
> > >   /*
> > >    * In `contention_end()`, the current task will be the new owner of
> > >    * the `lock`, if `ctx[1]` is not 0.
> > >    */
> > >   contention_end() {
> > >     Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> > >
> > >     Create `contention_key` with `owner_tracing_data.stack_id` and
> > >       lookup `contention_data` in `owner_stat`.
> > >     if (`contention_data` does not exist) {
> > >       Create new entry for `contention_key`:`contention_data` in
> > >         `owner_stat`.
> > >     }
> > >     else {
> > >       Update the `count` and `total_time` in existing `contention_data`.
> > >     }
> > >
> > >     /*
> > >      * There is no more waiters, contention is over, delete the record.
> > >      */
> > >     if (`owner_tracing_data.count` <= 1) {
> > >       delete this record in `owner_data`.
> > >     }
> > >     /*
> > >      * Contention is still on going.
> > >      */
> > >     else {
> > >       `owner_tracing_data.count`--
> > >
> > >       if (`!ctx[0]`) {
> > >         The current task exits without acquiring the lock. However we
> > >           check for the recorded owner if the stack is changed, and
> > >           update `onwer_data` and `owner_stat` accordingly.
> > >       }
> > >       else {
> > >         The current task is the new owner, retrieve its stack, store it
> > >           in `owner_stack` and update `owner_tracing_data`.
> > >       }
> > >     }
> > >   }
> >
> > I think this is too much detail. :)
> >
> > I'd say something like this:
> >
> > This implements per-callstack aggregation of lock owners in addition to
> > per-thread.  The owner callstack is captured using bpf_get_task_stack()
> > at contention_begin and it also adds a custom stackid function for the
> > owner stacks to be compared easily.
> >
> > The owner info is kept in a hash map using lock addr as a key to handle
> > multiple waiters for the same lock.  At contention_end, it updates the
> > owner lock stat based on the info that was saved at contention_begin.
> > If there are more waiters, it'd update the owner pid to itself as
> > contention_end means it gets the lock now.  But it also needs to check
> > the return value of the lock function in case task was killed by a signal
> > or something.
> >
> 
> Thanks, I will just reuse this description. :D
> 
> > >
> > > Signed-off-by: Chun-Tse Shao <ctshao@xxxxxxxxxx>
> > > ---
> > >  .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
> > >  1 file changed, 247 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > index 23fe9cc980ae..b12df873379f 100644
> > > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > @@ -197,6 +197,9 @@ int data_fail;
> > >  int task_map_full;
> > >  int data_map_full;
> > >
> > > +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> > > +void bpf_task_release(struct task_struct *p) __ksym __weak;
> > > +
> > >  static inline __u64 get_current_cgroup_id(void)
> > >  {
> > >       struct task_struct *task;
> > > @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
> > >       return pelem;
> > >  }
> > >
> > > +static inline s32 get_owner_stack_id(u64 *stacktrace)
> > > +{
> > > +     s32 *id, new_id;
> > > +     static s64 id_gen = 1;
> > > +
> > > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > +     if (id)
> > > +             return *id;
> > > +
> > > +     new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> > > +
> > > +     bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> > > +
> > > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > +     if (id)
> > > +             return *id;
> > > +
> > > +     return -1;
> > > +}
> > > +
> > >  SEC("tp_btf/contention_begin")
> > >  int contention_begin(u64 *ctx)
> > >  {
> > > @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
> > >       pelem->flags = (__u32)ctx[1];
> > >
> > >       if (needs_callstack) {
> > > +             u32 i = 0;
> > > +             u32 id = 0;
> > > +             int owner_pid;
> > > +             u64 *buf;
> > > +             struct task_struct *task;
> > > +             struct owner_tracing_data *otdata;
> > > +
> > > +             if (!lock_owner)
> > > +                     goto skip_owner_begin;
> > > +
> > > +             task = get_lock_owner(pelem->lock, pelem->flags);
> > > +             if (!task)
> > > +                     goto skip_owner_begin;
> > > +
> > > +             owner_pid = BPF_CORE_READ(task, pid);
> > > +
> > > +             buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > +             if (!buf)
> > > +                     goto skip_owner_begin;
> > > +             for (i = 0; i < max_stack; i++)
> > > +                     buf[i] = 0x0;
> > > +
> > > +             if (bpf_task_from_pid) {
> >
> > I think you can do this instead:
> >
> >                 if (bpf_task_from_pid == NULL)
> >                         goto skip_owner_begin;
> >
> > nit: it can be just 'skip_owner'. :)
> >
> >
> > > +                     task = bpf_task_from_pid(owner_pid);
> > > +                     if (task) {
> > > +                             bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> > > +                             bpf_task_release(task);
> > > +                     }
> > > +             }
> > > +
> > > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > +             id = get_owner_stack_id(buf);
> > > +
> > > +             /*
> > > +              * Contention just happens, or corner case `lock` is owned by process not
> > > +              * `owner_pid`. For the corner case we treat it as unexpected internal error and
> > > +              * just ignore the precvious tracing record.
> > > +              */
> > > +             if (!otdata || otdata->pid != owner_pid) {
> > > +                     struct owner_tracing_data first = {
> > > +                             .pid = owner_pid,
> > > +                             .timestamp = pelem->timestamp,
> > > +                             .count = 1,
> > > +                             .stack_id = id,
> > > +                     };
> > > +                     bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> > > +             }
> > > +             /* Contention is ongoing and new waiter joins */
> > > +             else {
> > > +                     __sync_fetch_and_add(&otdata->count, 1);
> > > +
> > > +                     /*
> > > +                      * The owner is the same, but stacktrace might be changed. In this case we
> > > +                      * store/update `owner_stat` based on current owner stack id.
> > > +                      */
> > > +                     if (id != otdata->stack_id) {
> > > +                             u64 duration = otdata->timestamp - pelem->timestamp;
> >
> > Isn't it the opposite?
> >
> >         u64 duration = pelem->timestamp - otdata->timestamp;
> >
> >
> > > +                             struct contention_key ckey = {
> > > +                                     .stack_id = id,
> > > +                                     .pid = 0,
> > > +                                     .lock_addr_or_cgroup = 0,
> > > +                             };
> > > +                             struct contention_data *cdata =
> > > +                                     bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +                             if (!cdata) {
> > > +                                     struct contention_data first = {
> > > +                                             .total_time = duration,
> > > +                                             .max_time = duration,
> > > +                                             .min_time = duration,
> > > +                                             .count = 1,
> > > +                                             .flags = pelem->flags,
> > > +                                     };
> > > +                                     bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > +                                                         BPF_NOEXIST);
> > > +                             } else {
> > > +                                     __sync_fetch_and_add(&cdata->total_time, duration);
> > > +                                     __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                                     /* FIXME: need atomic operations */
> > > +                                     if (cdata->max_time < duration)
> > > +                                             cdata->max_time = duration;
> > > +                                     if (cdata->min_time > duration)
> > > +                                             cdata->min_time = duration;
> > > +                             }
> >
> > And as I said before, can we move this block out as a function?
> >
> > > +
> > > +                             otdata->timestamp = pelem->timestamp;
> > > +                             otdata->stack_id = id;
> > > +                     }
> > > +             }
> > > +skip_owner_begin:
> > >               pelem->stack_id = bpf_get_stackid(ctx, &stacks,
> > >                                                 BPF_F_FAST_STACK_CMP | stack_skip);
> > >               if (pelem->stack_id < 0)
> > > @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
> > >       struct tstamp_data *pelem;
> > >       struct contention_key key = {};
> > >       struct contention_data *data;
> > > +     __u64 timestamp;
> > >       __u64 duration;
> > >       bool need_delete = false;
> > >
> > > @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
> > >               need_delete = true;
> > >       }
> > >
> > > -     duration = bpf_ktime_get_ns() - pelem->timestamp;
> > > +     timestamp = bpf_ktime_get_ns();
> > > +     duration = timestamp - pelem->timestamp;
> > >       if ((__s64)duration < 0) {
> > >               __sync_fetch_and_add(&time_fail, 1);
> > >               goto out;
> > >       }
> > >
> > > +     if (needs_callstack && lock_owner) {
> > > +             u64 owner_time;
> > > +             struct contention_key ckey = {};
> > > +             struct contention_data *cdata;
> > > +             struct owner_tracing_data *otdata;
> > > +
> > > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > +             if (!otdata)
> > > +                     goto skip_owner_end;
> > > +
> > > +             /* Update `owner_stat` */
> > > +             owner_time = timestamp - otdata->timestamp;
> > > +             ckey.stack_id = otdata->stack_id;
> > > +             cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +             if (!cdata) {
> > > +                     struct contention_data first = {
> > > +                             .total_time = owner_time,
> > > +                             .max_time = owner_time,
> > > +                             .min_time = owner_time,
> > > +                             .count = 1,
> > > +                             .flags = pelem->flags,
> > > +                     };
> > > +                     bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> > > +             } else {
> > > +                     __sync_fetch_and_add(&cdata->total_time, owner_time);
> > > +                     __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                     /* FIXME: need atomic operations */
> > > +                     if (cdata->max_time < owner_time)
> > > +                             cdata->max_time = owner_time;
> > > +                     if (cdata->min_time > owner_time)
> > > +                             cdata->min_time = owner_time;
> > > +             }
> > > +
> > > +             /* No contention is occurring, delete `lock` entry in `owner_data` */
> > > +             if (otdata->count <= 1)
> > > +                     bpf_map_delete_elem(&owner_data, &pelem->lock);
> > > +             /*
> > > +              * Contention is still ongoing, with a new owner (current task). `owner_data`
> > > +              * should be updated accordingly.
> > > +              */
> > > +             else {
> > > +                     u32 i = 0;
> > > +                     u64 *buf;
> > > +
> > > +                     __sync_fetch_and_add(&otdata->count, -1);
> > > +
> > > +                     buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > +                     if (!buf)
> > > +                             goto skip_owner_end;
> > > +                     for (i = 0; i < (u32)max_stack; i++)
> > > +                             buf[i] = 0x0;
> > > +
> > > +                     /*
> > > +                      * ctx[1] has the return code of the lock function.
> >
> > Then I think it's clearer to have a local variable named 'ret' or so.
> >
> >
> > > +                      * If ctx[1] is not 0, the current task terminates lock waiting without
> > > +                      * acquiring it. Owner is not changed, but we still need to update the owner
> > > +                      * stack.
> > > +                      */
> > > +                     if (!ctx[1]) {
> >
> > This doesn't match to the comment.  It should be:
> >
> >                         if (ret < 0) {
> >
> >
> > > +                             s32 id = 0;
> > > +                             struct task_struct *task = NULL;
> > > +
> > > +                             if (bpf_task_from_pid)
> >
> > Same as the above.  No need to go down if you cannot get the task and
> > stack.
> >
> >                                 if (bpf_task_from_pid == NULL)
> >                                         goto skip_owner;
> >
> >
> > > +                                     task = bpf_task_from_pid(otdata->pid);
> > > +
> > > +                             if (task) {
> > > +                                     bpf_get_task_stack(task, buf,
> > > +                                                        max_stack * sizeof(unsigned long), 0);
> > > +                                     bpf_task_release(task);
> > > +                             }
> > > +
> > > +                             id = get_owner_stack_id(buf);
> > > +
> > > +                             /*
> > > +                              * If owner stack is changed, update `owner_data` and `owner_stat`
> > > +                              * accordingly.
> > > +                              */
> > > +                             if (id != otdata->stack_id) {
> > > +                                     u64 duration = otdata->timestamp - pelem->timestamp;
> > > +                                     struct contention_key ckey = {
> > > +                                             .stack_id = id,
> > > +                                             .pid = 0,
> > > +                                             .lock_addr_or_cgroup = 0,
> > > +                                     };
> > > +                                     struct contention_data *cdata =
> > > +                                             bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > +                                     if (!cdata) {
> > > +                                             struct contention_data first = {
> > > +                                                     .total_time = duration,
> > > +                                                     .max_time = duration,
> > > +                                                     .min_time = duration,
> > > +                                                     .count = 1,
> > > +                                                     .flags = pelem->flags,
> > > +                                             };
> > > +                                             bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > +                                                                 BPF_NOEXIST);
> > > +                                     } else {
> > > +                                             __sync_fetch_and_add(&cdata->total_time, duration);
> > > +                                             __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > +                                             /* FIXME: need atomic operations */
> > > +                                             if (cdata->max_time < duration)
> > > +                                                     cdata->max_time = duration;
> > > +                                             if (cdata->min_time > duration)
> > > +                                                     cdata->min_time = duration;
> > > +                                     }
> > > +
> > > +                                     otdata->timestamp = pelem->timestamp;
> > > +                                     otdata->stack_id = id;
> > > +                             }
> > > +                     }
> > > +                     /*
> > > +                      * If ctx[1] is 0, then update tracinng data with the current task, which is
> > > +                      * the new owner.
> > > +                      */
> > > +                     else {
> > > +                             otdata->pid = pid;
> > > +                             otdata->timestamp = timestamp;
> > > +
> > > +                             bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> > > +                                                max_stack * sizeof(unsigned long), 0);
> >
> > This would be meaningless since it's still in the contention path.
> > Current callstack will be the same as the waiter callstack.  You'd
> > better just invalidate callstack here and let the next waiter update
> > it.
> 
> I wonder why this is meaningless. In this situation, the lock owner is
> transferred to the current task, and there is at least one more
> waiter, the contention is still ongoing. `otdata` is for tracing the
> lock owner, so it should be correctly updated with the new owner,
> which is the current task.

Yep, but I meant it has the same callstack as with waiters and provides
no additional information about the owner.

Thanks,
Namhyung

> >
> > > +                             otdata->stack_id = get_owner_stack_id(buf);
> > > +                     }
> > > +             }
> > > +     }
> > > +skip_owner_end:
> > > +
> > >       switch (aggr_mode) {
> > >       case LOCK_AGGR_CALLER:
> > >               key.stack_id = pelem->stack_id;
> > > --
> > > 2.48.1.362.g079036d154-goog
> > >