On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > On Mon, 24 Jun 2024 17:21:36 -0700 > Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > Anyways, under exclusive writer lock, we double-check that refcount > > didn't change and is still zero. If it is, we proceed with destruction, > > because at that point we have a guarantee that find_active_uprobe() > > can't successfully look up this uprobe instance, as it's going to be > > removed in destructor under writer lock. If, on the other hand, > > find_active_uprobe() managed to bump refcount from zero to one in > > between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and > > write_lock(&uprobes_treelock), we'll deterministically detect this with > > extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we > > pretend like atomic_dec_and_test() never returned true. There is no > > resource freeing or any other irreversible action taken up till this > > point, so we just exit early. > > > > One tricky part in the above is actually two CPUs racing and dropping > > refcnt to zero, and then attempting to free resources. This can happen > > as follows: > > - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock; > > - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at > > which point it decides that it needs to free uprobe as well. > > > > At this point both CPU #0 and CPU #1 will believe they need to destroy > > uprobe, which is obviously wrong. To prevent this situations, we augment > > refcount with epoch counter, which is always incremented by 1 on either > > get or put operation. This allows those two CPUs above to disambiguate > > who should actually free uprobe (it's the CPU #1, because it has > > up-to-date epoch). See comments in the code and note the specific values > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that > > a single atomi64_t is actually a two sort-of-independent 32-bit counters > > that are incremented/decremented with a single atomic_add_and_return() > > operation. Note also a small and extremely rare (and thus having no > > effect on performance) need to clear the highest bit every 2 billion > > get/put operations to prevent high 32-bit counter from "bleeding over" > > into lower 32-bit counter. > > I have a question here. > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref > again? e.g. > > CPU#0 CPU#1 > > put_uprobe() { > atomic64_add_return() > __get_uprobe(); > put_uprobe() { > kfree(uprobe) > } > write_lock(&uprobes_treelock); > atomic64_read(&uprobe->ref); > } > > I think it is very rare case, but I could not find any code to prevent > this scenario. > Yes, I think you are right, great catch! I concentrated on preventing double kfree() in this situation, and somehow convinced myself that eager kfree() is fine. But I think I'll need to delay freeing, probably with RCU. The problem is that we can't use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has to be a sleepable variant of RCU. I'm thinking of using rcu_read_lock_trace(), the same variant of RCU we use for sleepable BPF programs (including sleepable uprobes). srcu might be too heavy for this. I'll try a few variants over the next few days and see how the performance looks. > Thank you, > > > -- > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> >