On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > On 08/12, Andrii Nakryiko wrote: > > > > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take > > uretprobe-specific SRCU lock and keep it active as kernel transfers > > control back to user space. > ... > > include/linux/uprobes.h | 49 ++++++- > > kernel/events/uprobes.c | 294 ++++++++++++++++++++++++++++++++++------ > > 2 files changed, 301 insertions(+), 42 deletions(-) > > Oh. To be honest I don't like this patch. > > I would like to know what other reviewers think, but to me it adds too many > complications that I can't even fully understand... Which parts? The atomic xchg() and cmpxchg() parts? What exactly do you feel like you don't fully understand? > > And how much does it help performance-wise? A lot, as we increase uprobe parallelism. Here's a subset of benchmarks for 1-4, 8, 16, 32, 64, and 80 threads firing uretprobe. With and without this SRCU change, but including all the other changes, including the lockless VMA lookup. It's noticeable already with just two competing CPUs/threads, and it just gets much worse from there. Of course in production you shouldn't come close to such rates of uprobe/uretprobe firing, so this is definitely a microbenchmark emphasizing the sharing between CPUs, but it still adds up. And we do have production use cases that would like to fire uprobes at 100K+ per second rates. WITH SRCU for uretprobes ======================== uretprobe-nop ( 1 cpus): 1.968 ± 0.001M/s ( 1.968M/s/cpu) uretprobe-nop ( 2 cpus): 3.739 ± 0.003M/s ( 1.869M/s/cpu) uretprobe-nop ( 3 cpus): 5.616 ± 0.003M/s ( 1.872M/s/cpu) uretprobe-nop ( 4 cpus): 7.286 ± 0.002M/s ( 1.822M/s/cpu) uretprobe-nop ( 8 cpus): 13.657 ± 0.007M/s ( 1.707M/s/cpu) uretprobe-nop (32 cpus): 45.305 ± 0.066M/s ( 1.416M/s/cpu) uretprobe-nop (64 cpus): 42.390 ± 0.922M/s ( 0.662M/s/cpu) uretprobe-nop (80 cpus): 47.554 ± 2.411M/s ( 0.594M/s/cpu) WITHOUT SRCU for uretprobes =========================== uretprobe-nop ( 1 cpus): 2.197 ± 0.002M/s ( 2.197M/s/cpu) uretprobe-nop ( 2 cpus): 3.325 ± 0.001M/s ( 1.662M/s/cpu) uretprobe-nop ( 3 cpus): 4.129 ± 0.002M/s ( 1.376M/s/cpu) uretprobe-nop ( 4 cpus): 6.180 ± 0.003M/s ( 1.545M/s/cpu) uretprobe-nop ( 8 cpus): 7.323 ± 0.005M/s ( 0.915M/s/cpu) uretprobe-nop (16 cpus): 6.943 ± 0.005M/s ( 0.434M/s/cpu) uretprobe-nop (32 cpus): 5.931 ± 0.014M/s ( 0.185M/s/cpu) uretprobe-nop (64 cpus): 5.145 ± 0.003M/s ( 0.080M/s/cpu) uretprobe-nop (80 cpus): 4.925 ± 0.005M/s ( 0.062M/s/cpu) > > I'll try to take another look, and I'll try to think about other approaches, > not that I have something better in mind... Ok. > > > But lets forgets this patch for the moment. The next one adds even more > complications, and I think it doesn't make sense. > "Even more complications" is a bit of an overstatement. It just applies everything we do for uretprobes in this patch to a very straightforward single-stepped case. > As I have already mentioned in the previous discussions, we can simply kill > utask->active_uprobe. And utask->auprobe. I don't have anything against that, in principle, but let's benchmark and test that thoroughly. I'm a bit uneasy about the possibility that some arch-specific code will do container_of() on this arch_uprobe in order to get to uprobe, we'd need to audit all the code to make sure that can't happen. Also it's a bit unfortunate that we have to assume that struct arch_uprobe is small on all architectures, and there is no code that assumes it can't be moved, etc, etc. (I also don't get why you need memcpy > > So can't we start with the patch below? On top of your 08/13. It doesn't kill > utask->auprobe yet, this needs a bit more trivial changes. > > What do you think? I think that single-stepped case isn't the main use case (typically uprobe/uretprobe will be installed on nop or push %rbp, both emulated). uretprobes, though, are the main use case (along with optimized entry uprobes). So what we do about single-stepped is a bit secondary (for me, looking at production use cases). But we do need to do something with uretprobes first and foremost. > > Oleg. > > ------------------------------------------------------------------------------- > From d7cb674eb6f7bb891408b2b6a5fb872a6c2f0f6c Mon Sep 17 00:00:00 2001 > From: Oleg Nesterov <oleg@xxxxxxxxxx> > Date: Mon, 19 Aug 2024 15:34:55 +0200 > Subject: [RFC PATCH] uprobe: kill uprobe_task->active_uprobe > > Untested, not for inclusion yet, and I need to split it into 2 changes. > It does 2 simple things: > > 1. active_uprobe != NULL is possible if and only if utask->state != 0, > so it turns the active_uprobe checks into the utask->state checks. > > 2. handle_singlestep() doesn't really need ->active_uprobe, it only > needs uprobe->arch which is "const" after prepare_uprobe(). > > So this patch adds the new "arch_uprobe uarch" member into utask > and changes pre_ssout() to do memcpy(&utask->uarch, &uprobe->arch). > --- > include/linux/uprobes.h | 2 +- > kernel/events/uprobes.c | 37 +++++++++++-------------------------- > 2 files changed, 12 insertions(+), 27 deletions(-) [...]