Re: [PATCH dwarves v2 3/3] pahole: Use per-thread btf instances to avoid mutex locking.

Kui-Feng Lee <kuifeng@xxxxxx> · Tue, 25 Jan 2022 19:38:04 +0000

On Mon, 2022-01-24 at 12:13 -0800, Andrii Nakryiko wrote:
> On Mon, Jan 24, 2022 at 11:19 AM Kui-Feng Lee <kuifeng@xxxxxx> wrote:
> > 
> > Create an instance of btf for each worker thread, and add type info
> > to
> > the local btf instance in the steal-function of pahole without mutex
> > acquiring.  Once finished with all worker threads, merge all
> > per-thread btf instances to the primary btf instance.
> > 
> > Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx>
> > ---
> 
> BTW, I've already synced your btf__add_btf() optimization to Github,
> so you can bump pahole libbpf submodule reference in the next
> revision. You'll get a bunch of deprecation warnings, so it would be
> great to fix those at the same time.

Got it!

...... cut .....
> > 
> > +static int pahole_thread_exit(struct conf_load *conf, void
> > *thr_data)
> > +{
> > +       struct thread_data *thread = thr_data;
> > +
> > +       if (thread == NULL)
> > +               return 0;
> > +
> > +       /*
> > +        * Here we will call btf__dedup() here once we extend
> > +        * btf__dedup().
> > +        */
> > +
> > +       return 0;
> > +}
> > +
> > 
> You've ignored my question and suggestion to move btf__add_btf() into
> pahole_thread_exit. Can you please comment on why it's not a good
> idea, if you insist on not doing that?

Sorry for not replying for this part.
I tried it yesterday, and I don't see obvious improvement.  I guess the
improvement is not big enough to be observed.

> > @@ -2819,8 +2886,8 @@ static enum load_steal_kind
> > pahole_stealer(struct cu *cu,
> > 
> >         if (btf_encode) {
> >                 static pthread_mutex_t btf_lock =
> > PTHREAD_MUTEX_INITIALIZER;
> > +               struct btf_encoder *encoder;
> > 
> > -               pthread_mutex_lock(&btf_lock);
> >                 /*
> >                  * FIXME:
> >                  *
> > @@ -2828,21 +2895,55 @@ static enum load_steal_kind
> > pahole_stealer(struct cu *cu,
> >                  * point we'll have cu->elf setup...
> >                  */
> >                 if (!btf_encoder) {
> > -                       btf_encoder = btf_encoder__new(cu,
> > detached_btf_filename, conf_load->base_btf, skip_encoding_btf_vars,
> > -                                                     
> > btf_encode_force, btf_gen_floats, global_verbose);
> > +                       pthread_mutex_lock(&btf_lock);
> > +                       if (!btf_encoder) {
> 
> are you trying to minimize lock contention here with !btf_encoder
> check outside and inside locked area? Why? It just adds more code
> nesting and this locking can't be a real bottleneck. We are talking
> about a very few threads being initialized. Please keep it simple,
> lock, check !btf_encode and goto unlock, if it's already set.
> Otherwise proceed to initialization.

Got it!

.......................
> > +                               if (btf_encoder && thr_data) {
> > +                                       struct thread_data *thread =
> > (struct thread_data *)thr_data;
> > +
> > +                                       thread->encoder =
> > btf_encoder;
> > +                                       thread->btf =
> > btf_encoder__btf(btf_encoder);
> 
> Can you summarize the relationship between thr_data, btf_encode and
> thread->encoder? I'm re-reading this code (including `if (thr_data)`
> piece below) over and over and can't figure out why the initialization
> pattern is so complicated.

thr_data keeps per-thread information to worker threads. So, every
worker thread has its instance to keep an encoder. The main thread use
btf_encoder to encode BTF. thread->encoder is for a worker thread to
add type info. Once finishing all worker threads, the main thread will
add the btf instances of thread->encoders to the btf instance of
btf_encoder. However, I reuse btf_encoder for the first worker thread,
which reaches this function to create btf_encoder. It avoids copying
data for the first worker thread.

Does it make sense? If so, I will put the explanation in the code.