On Fri, Sep 27, 2024 at 2:56 AM Liao Chang <liaochang1@xxxxxxxxxx> wrote: > > The uprobe handler allocates xol slot from xol_area and quickly release > it in the single-step handler. The atomic operations on the xol bitmap > and slot_count lead to expensive cache line bouncing between multiple > CPUs. Given the xol slot is on the hot path for uprobe and kretprobe > handling, the profiling results on some Arm64 machine show that nearly > 80% of cycles are spent on this. So optimizing xol slot usage will > become important to scalability after the Andrii's series land. > > This patch address this scalability issues from two perspectives: > > - Allocated xol slot is now saved in the thread associated utask data. > It allows to reuse it throughout the therad's lifetime. This avoid > the frequent atomic operation on the slot bitmap and slot_count of > xol_area data, which is the major negative impact on scalability. > > - A garbage collection routine xol_recycle_insn_slot() is introduced to > reclaim unused xol slots. utask instances that own xol slot but > haven't reclaimed them are linked in a linked list. When xol_area runs > out of slots, the garbage collection routine travel the list to free > one slot. Allocated xol slots is marked as unused in single-step > handler. While the marking relies on the refcount of utask instance, > due to thread can't run on multiple CPUs at same time, therefore, it > is unlikely CPUs take the refcount of same utask, minimizing cache > line bouncing. > > Upon thread exit, the utask is deleted from the linked-list, and the > associated xol slot will be free. > > v2->v1: > ------- > As suggested by Andi Kleen [1], the updates to the garbage collection > list of xol slots is not a common case. This revision replaces the > complex lockless RCU scheme with a simple raw spinlock. > > Here's an explanation of the locking and refcount update scheme in the > patch: > > - area->gc_lock protects the write operations on the area->gc_list. This > includes inserting slot into area->gc_list in xol_get_insn_slot() and > removing slot from area->gc_list in xol_recycle_insn_slot(). > > - utask->slot_ref is used to track the status of uprobe_task instance > associated insn_slot. It has three values, the value of 1 means the > slot is free to use or recycle. The value of 2 means the slot is in > use. The value of 0 means the slot is being recycled. This design > ensure that slots in use aren't recycled from GC list and that slots > being recycled aren't available for uprobe use. For example, > refcount_inc_not_zero() turns the value from 1 to 2 in uprobe BRK > handling, Using refcount_dec() to turn it from 2 to 1 during uprobe > single-step handling. Using refcount_dec_if_one() to turn the value > from 1 to 0 when recycling slot from GC list. > > [1] https://lore.kernel.org/all/ZuwoUmqXrztp-Mzh@tassilo/ > > Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx> > --- > include/linux/uprobes.h | 4 + > kernel/events/uprobes.c | 177 ++++++++++++++++++++++++++++++---------- > 2 files changed, 139 insertions(+), 42 deletions(-) > Liao, Assuming your ARM64 improvements go through, would you still need these changes? XOL case is a slow case and if possible should be avoided at all costs. If all common cases for ARM64 are covered through instruction emulation, would we need to add all this complexity to optimize slow case? [...]