On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote: > > As you noted, that percpu-rwsem write side is quite insane. And you're > > creating this batch complexity to mitigate that. > > > Note that batch API is needed regardless of percpu RW semaphore or > not. As I mentioned, once uprobes_treelock is mitigated one way or the > other, the next one is uprobe->register_rwsem. For scalability, we > need to get rid of it and preferably not add any locking at all. So > tentatively I'd like to have lockless RCU-protected iteration over > uprobe->consumers list and call consumer->handler(). This means that > on uprobes_unregister we'd need synchronize_rcu (for whatever RCU > flavor we end up using), to ensure that we don't free uprobe_consumer > memory from under handle_swbp() while it is actually triggering > consumers. > > So, without batched unregistration we'll be back to the same problem > I'm solving here: doing synchronize_rcu() for each attached uprobe one > by one is prohibitively slow. We went through this exercise with > ftrace/kprobes already and fixed it with batched APIs. Doing that for > uprobes seems unavoidable as well. I'm not immediately seeing how you need that terrible refcount stuff for the batching though. If all you need is group a few unregisters together in order to share a sync_rcu() that seems way overkill. You seem to have muddled the order of things, which makes the actual reason for doing things utterly unclear.