Hello, I wanted to provide a bit of a context about and tie together a few separate work streams (across a few separate kernel trees) all revolving around uprobe improvements, as there are a bunch of them and I'm sure it's hard to keep track of all of them. And hopefully I can also get Peter and ARM maintainer's input on some specific questions I asked below. Thank you in advance! In short, in the last few months there was a high activity around fixing and improving uprobes. All this is the result of increased and more varied use of uprobes/uretprobe in production settings. Uprobe performance is **very** important, and yes, we do have real use cases that go to millions per second uprobe/uretprobe triggering throughput, unfortunately. So any small bit of performance and scalability improvement is helpful. No, this isn't just some nerdy perf optimization work (I've been asked this a few times, so I thought I'd emphasize this again). So, we've already landed a bunch of work, mainly (not an exhaustive list): - various clean ups, API improvements, and bug fixes from Oleg Nesterov ([0], [1]). This simplified internal APIs and was a prerequisite of the rest of the work; - changes to refcounting and RCU-ifying of uprobe lifetime from me ([2]). This improved single-threaded performance somewhat, but mainly significantly improved scalability in the presence of multiple CPUs triggering lots of uprobes; - ARM64-specific optimization of uprobe emulation of NOP instruction by Liao Chang ([3]). This change alone gives 2x (!) speed up for a USDT tracing use cases *on ARM64* (we already have this optimization in x86-64); - there was a bit earlier work by Jiri Olsa ([4]) to add uretprobe() syscall, giving +30% speed ups. And there are a few more outstanding changes: - Jiri Olsa's uprobe "session" support ([5]). This is less performance focused, but important functionality by itself. But I'm calling this out here because the first two patches are pure uprobe internal changes, and I believe they should go into tip/perf/core to avoid conflicts with the rest of pending uprobe changes. Peter, do you mind applying those two and creating a stable tag for bpf-next to pull? We'll apply the rest of Jiri's series to bpf-next/master. - Liao Chang's ARM64-specific STP instruction emulation support ([6]). This one will give 2x (!) improvement for a common case of having STP instruction being a first instruction in traced user function (similar to NOP for USDTs). ARM64 maintainers (cc'ed Catalin, Will, and Mark), can you guys please take another look? This one was a bit more controversial, but hopefully there is a way to massage it to be acceptable and not introduce unnecessary slowdowns (there were some concerns about memory ordering/visibility, which hopefully don't apply to uprobe cases). It's an important improvement, I'd really appreciate it if we can make progress here, thank you! - my speculative VMA-to-uprobe lookup series ([7]). This makes entry uprobe scalability scale linearly with the number of CPUs (the ultimate goal of uprobe scalability work). I think it's ready to go in. It has **implicit** dependency on Christian Brauner's recent change for FMODE_BACKING, for which he provided a stable tag. Peter, do you have any remaining concerns or this can be also merged soon? - another patch set of mine, switching uretprobe fast path to SRCU (with timeout) ([8]). This makes return uprobes (uretprobes) linearly scalable in the common case (again, the ultimate scalability goal). I haven't gotten much feedback here, would love to get some objective review here. This is an important counterpart to the speculative VMA-to-uprobe lookup series. Both are needed in practice. - patch set dropping unnecessary siglock usage in uprobe by Liao Chang ([9]). This one removes yet another lock, for a less common case (at least on x86-64) of single-stepped uprobe (where the probed instruction can't be emulated). This one needs a rebase, but it was already acked by Oleg. Liao, please prioritize the rebase and send v4 ASAP, so this is not lost. As you can see, lots of stuff needs to be landed and most of it is in good shape already. I'd love to hear thoughts of relevant people called out above, thank you! [0] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@xxxxxxxxxx/ [1] https://lore.kernel.org/linux-trace-kernel/20240929144201.GA9429@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-trace-kernel/20240903174603.3554182-1-andrii@xxxxxxxxxx/ [3] https://lore.kernel.org/linux-trace-kernel/20240909071114.1150053-1-liaochang1@xxxxxxxxxx/ [4] https://lore.kernel.org/linux-trace-kernel/20240523121149.575616-1-jolsa@xxxxxxxxxx/ [5] https://lore.kernel.org/bpf/20241015091050.3731669-1-jolsa@xxxxxxxxxx/ [6] https://lore.kernel.org/linux-trace-kernel/20240910060407.1427716-1-liaochang1@xxxxxxxxxx/ [7] https://lore.kernel.org/linux-trace-kernel/20241010205644.3831427-1-andrii@xxxxxxxxxx/ [8] https://lore.kernel.org/linux-trace-kernel/20241008002556.2332835-1-andrii@xxxxxxxxxx/ [9] https://lore.kernel.org/linux-trace-kernel/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/ -- Andrii