Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
> >
> > On Mon, 24 Jun 2024 17:21:36 -0700
> > Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > > Anyways, under exclusive writer lock, we double-check that refcount
> > > didn't change and is still zero. If it is, we proceed with destruction,
> > > because at that point we have a guarantee that find_active_uprobe()
> > > can't successfully look up this uprobe instance, as it's going to be
> > > removed in destructor under writer lock. If, on the other hand,
> > > find_active_uprobe() managed to bump refcount from zero to one in
> > > between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
> > > write_lock(&uprobes_treelock), we'll deterministically detect this with
> > > extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
> > > pretend like atomic_dec_and_test() never returned true. There is no
> > > resource freeing or any other irreversible action taken up till this
> > > point, so we just exit early.
> > >
> > > One tricky part in the above is actually two CPUs racing and dropping
> > > refcnt to zero, and then attempting to free resources. This can happen
> > > as follows:
> > >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
> > >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > >     which point it decides that it needs to free uprobe as well.
> > >
> > > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > > uprobe, which is obviously wrong. To prevent this situations, we augment
> > > refcount with epoch counter, which is always incremented by 1 on either
> > > get or put operation. This allows those two CPUs above to disambiguate
> > > who should actually free uprobe (it's the CPU #1, because it has
> > > up-to-date epoch). See comments in the code and note the specific values
> > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > > that are incremented/decremented with a single atomic_add_and_return()
> > > operation. Note also a small and extremely rare (and thus having no
> > > effect on performance) need to clear the highest bit every 2 billion
> > > get/put operations to prevent high 32-bit counter from "bleeding over"
> > > into lower 32-bit counter.
> >
> > I have a question here.
> > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> > again? e.g.
> >
> > CPU#0                                                   CPU#1
> >
> > put_uprobe() {
> >         atomic64_add_return()
> >                                                         __get_uprobe();
> >                                                         put_uprobe() {
> >                                                                 kfree(uprobe)
> >                                                         }
> >         write_lock(&uprobes_treelock);
> >         atomic64_read(&uprobe->ref);
> > }
> >
> > I think it is very rare case, but I could not find any code to prevent
> > this scenario.
> >
>
> Yes, I think you are right, great catch!
>
> I concentrated on preventing double kfree() in this situation, and
> somehow convinced myself that eager kfree() is fine. But I think I'll
> need to delay freeing, probably with RCU. The problem is that we can't
> use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
> to be a sleepable variant of RCU. I'm thinking of using
> rcu_read_lock_trace(), the same variant of RCU we use for sleepable
> BPF programs (including sleepable uprobes). srcu might be too heavy
> for this.
>
> I'll try a few variants over the next few days and see how the
> performance looks.
>

So I think I'm going with the changes below, incorporated into this
patch (nothing else changes). __get_uprobe() doesn't need any added
RCU protection (we know that uprobe is alive). It's only put_uprobe()
that needs to guarantee RCU protection before we drop refcount all the
way until we know whether we are the winning destructor or not.

Good thing is that the changes are pretty minimal in code and also
don't seem to regress performance/scalability. So I'm pretty happy
about that, will send v2 soon.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 07ad8b2e7508..41d9e37633ca 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -56,6 +56,7 @@ struct uprobe {
        atomic64_t              ref;            /* see
UPROBE_REFCNT_GET below */
        struct rw_semaphore     register_rwsem;
        struct rw_semaphore     consumer_rwsem;
+       struct rcu_head         rcu;
        struct list_head        pending_list;
        struct uprobe_consumer  *consumers;
        struct inode            *inode;         /* Also hold a ref to inode */
@@ -623,7 +624,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct
mm_struct *mm, unsigned long v
 #define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
 #define UPROBE_REFCNT_PUT (0xffffffffLL)

-/**
+/*
  * Caller has to make sure that:
  *   a) either uprobe's refcnt is positive before this call;
  *   b) or uprobes_treelock is held (doesn't matter if for read or write),
@@ -657,10 +658,26 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
        return !RB_EMPTY_NODE(&uprobe->rb_node);
 }

+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+       kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
        s64 v;

+       /*
+        * here uprobe instance is guaranteed to be alive, so we use Tasks
+        * Trace RCU to guarantee that uprobe won't be freed from under us, if
+        * we end up being a losing "destructor" inside uprobe_treelock'ed
+        * section double-checking uprobe->ref value below.
+        * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
+        */
+       rcu_read_lock_trace();
+
        v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);

        if (unlikely((u32)v == 0)) {
@@ -691,6 +708,8 @@ static void put_uprobe(struct uprobe *uprobe)
                        rb_erase(&uprobe->rb_node, &uprobes_tree);
                write_unlock(&uprobes_treelock);

+               rcu_read_unlock_trace();
+
                /* uprobe got resurrected, pretend we never tried to free it */
                if (!destroy)
                        return;
@@ -704,7 +723,7 @@ static void put_uprobe(struct uprobe *uprobe)
                delayed_uprobe_remove(uprobe, NULL);
                mutex_unlock(&delayed_uprobe_lock);

-               kfree(uprobe);
+               call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
                return;
        }

@@ -716,6 +735,8 @@ static void put_uprobe(struct uprobe *uprobe)
         */
        if (unlikely(v < 0))
                (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
+
+       rcu_read_unlock_trace();
 }

 static __always_inline



> > Thank you,
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
> >





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux