Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 27 Jun 2024 09:43:26 -0700

On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Mon, 24 Jun 2024 17:21:36 -0700
> Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> > Anyways, under exclusive writer lock, we double-check that refcount
> > didn't change and is still zero. If it is, we proceed with destruction,
> > because at that point we have a guarantee that find_active_uprobe()
> > can't successfully look up this uprobe instance, as it's going to be
> > removed in destructor under writer lock. If, on the other hand,
> > find_active_uprobe() managed to bump refcount from zero to one in
> > between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
> > write_lock(&uprobes_treelock), we'll deterministically detect this with
> > extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
> > pretend like atomic_dec_and_test() never returned true. There is no
> > resource freeing or any other irreversible action taken up till this
> > point, so we just exit early.
> >
> > One tricky part in the above is actually two CPUs racing and dropping
> > refcnt to zero, and then attempting to free resources. This can happen
> > as follows:
> >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
> >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> >     which point it decides that it needs to free uprobe as well.
> >
> > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > uprobe, which is obviously wrong. To prevent this situations, we augment
> > refcount with epoch counter, which is always incremented by 1 on either
> > get or put operation. This allows those two CPUs above to disambiguate
> > who should actually free uprobe (it's the CPU #1, because it has
> > up-to-date epoch). See comments in the code and note the specific values
> > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > that are incremented/decremented with a single atomic_add_and_return()
> > operation. Note also a small and extremely rare (and thus having no
> > effect on performance) need to clear the highest bit every 2 billion
> > get/put operations to prevent high 32-bit counter from "bleeding over"
> > into lower 32-bit counter.
>
> I have a question here.
> Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> again? e.g.
>
> CPU#0                                                   CPU#1
>
> put_uprobe() {
>         atomic64_add_return()
>                                                         __get_uprobe();
>                                                         put_uprobe() {
>                                                                 kfree(uprobe)
>                                                         }
>         write_lock(&uprobes_treelock);
>         atomic64_read(&uprobe->ref);
> }
>
> I think it is very rare case, but I could not find any code to prevent
> this scenario.
>

Yes, I think you are right, great catch!

I concentrated on preventing double kfree() in this situation, and
somehow convinced myself that eager kfree() is fine. But I think I'll
need to delay freeing, probably with RCU. The problem is that we can't
use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
to be a sleepable variant of RCU. I'm thinking of using
rcu_read_lock_trace(), the same variant of RCU we use for sleepable
BPF programs (including sleepable uprobes). srcu might be too heavy
for this.

I'll try a few variants over the next few days and see how the
performance looks.

> Thank you,
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
>