Re: [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 25 Jan 2022 13:49:56 -0800

On Tue, Jan 25, 2022 at 11:38 AM Kui-Feng Lee <kuifeng@xxxxxx> wrote:
>
> On Mon, 2022-01-24 at 12:13 -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 24, 2022 at 11:19 AM Kui-Feng Lee <kuifeng@xxxxxx> wrote:
> > >
> > > Create an instance of btf for each worker thread, and add type info
> > > to
> > > the local btf instance in the steal-function of pahole without mutex
> > > acquiring.  Once finished with all worker threads, merge all
> > > per-thread btf instances to the primary btf instance.
> > >
> > > Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx>
> > > ---
> >
> > BTW, I've already synced your btf__add_btf() optimization to Github,
> > so you can bump pahole libbpf submodule reference in the next
> > revision. You'll get a bunch of deprecation warnings, so it would be
> > great to fix those at the same time.
>
> Got it!
>
> ...... cut .....
> > >
> > > +static int pahole_thread_exit(struct conf_load *conf, void
> > > *thr_data)
> > > +{
> > > +       struct thread_data *thread = thr_data;
> > > +
> > > +       if (thread == NULL)
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * Here we will call btf__dedup() here once we extend
> > > +        * btf__dedup().
> > > +        */
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >
> > You've ignored my question and suggestion to move btf__add_btf() into
> > pahole_thread_exit. Can you please comment on why it's not a good
> > idea, if you insist on not doing that?
>
> Sorry for not replying for this part.
> I tried it yesterday, and I don't see obvious improvement.  I guess the
> improvement is not big enough to be observed.

I see. It still seems a good move anyways, because those per-thread
BTF instances are owned by each thread, so when the thread is done
it's cleaner to merge it into common BTF and free up local BTF,
instead of keeping it around for longer than necessary.

>
> > > @@ -2819,8 +2886,8 @@ static enum load_steal_kind
> > > pahole_stealer(struct cu *cu,
> > >
> > >         if (btf_encode) {
> > >                 static pthread_mutex_t btf_lock =
> > > PTHREAD_MUTEX_INITIALIZER;
> > > +               struct btf_encoder *encoder;
> > >
> > > -               pthread_mutex_lock(&btf_lock);
> > >                 /*
> > >                  * FIXME:
> > >                  *
> > > @@ -2828,21 +2895,55 @@ static enum load_steal_kind
> > > pahole_stealer(struct cu *cu,
> > >                  * point we'll have cu->elf setup...
> > >                  */
> > >                 if (!btf_encoder) {
> > > -                       btf_encoder = btf_encoder__new(cu,
> > > detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
> > > -
> > > btf_encode_force, btf_gen_floats, global_verbose);
> > > +                       pthread_mutex_lock(&btf_lock);
> > > +                       if (!btf_encoder) {
> >
> > are you trying to minimize lock contention here with !btf_encoder
> > check outside and inside locked area? Why? It just adds more code
> > nesting and this locking can't be a real bottleneck. We are talking
> > about a very few threads being initialized. Please keep it simple,
> > lock, check !btf_encode and goto unlock, if it's already set.
> > Otherwise proceed to initialization.
>
> Got it!
>
> .......................
> > > +                               if (btf_encoder && thr_data) {
> > > +                                       struct thread_data *thread =
> > > (struct thread_data *)thr_data;
> > > +
> > > +                                       thread->encoder =
> > > btf_encoder;
> > > +                                       thread->btf =
> > > btf_encoder__btf(btf_encoder);
> >
> > Can you summarize the relationship between thr_data, btf_encode and
> > thread->encoder? I'm re-reading this code (including `if (thr_data)`
> > piece below) over and over and can't figure out why the initialization
> > pattern is so complicated.
>
> thr_data keeps per-thread information to worker threads. So, every
> worker thread has its instance to keep an encoder. The main thread use
> btf_encoder to encode BTF. thread->encoder is for a worker thread to
> add type info. Once finishing all worker threads, the main thread will
> add the btf instances of thread->encoders to the btf instance of
> btf_encoder. However, I reuse btf_encoder for the first worker thread,
> which reaches this function to create btf_encoder. It avoids copying
> data for the first worker thread.
>
> Does it make sense? If so, I will put the explanation in the code.
>

Oh, I see, thanks, it makes it a bit clearer. Yes, please leave a
comment describing this, thanks!