On Wed, Jul 3, 2024 at 1:07 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote: > > > > As you noted, that percpu-rwsem write side is quite insane. And you're > > > creating this batch complexity to mitigate that. > > > > > > Note that batch API is needed regardless of percpu RW semaphore or > > not. As I mentioned, once uprobes_treelock is mitigated one way or the > > other, the next one is uprobe->register_rwsem. For scalability, we > > need to get rid of it and preferably not add any locking at all. So > > tentatively I'd like to have lockless RCU-protected iteration over > > uprobe->consumers list and call consumer->handler(). This means that > > on uprobes_unregister we'd need synchronize_rcu (for whatever RCU > > flavor we end up using), to ensure that we don't free uprobe_consumer > > memory from under handle_swbp() while it is actually triggering > > consumers. > > > > So, without batched unregistration we'll be back to the same problem > > I'm solving here: doing synchronize_rcu() for each attached uprobe one > > by one is prohibitively slow. We went through this exercise with > > ftrace/kprobes already and fixed it with batched APIs. Doing that for > > uprobes seems unavoidable as well. > > I'm not immediately seeing how you need that terrible refcount stuff for Which part is terrible, please be more specific. I can switch to refcount_inc_not_zero() and leave performance improvement on the table, but why is that a good idea? > the batching though. If all you need is group a few unregisters together > in order to share a sync_rcu() that seems way overkill. > > You seem to have muddled the order of things, which makes the actual > reason for doing things utterly unclear. See -EGAIN handling in uprobe_register() code in current upstream kernel. We manage to allocate and insert (or update existing) uprobe in uprobes_tree. And then when we try to register we can post factum detect that uprobe was removed from RB tree from under us. And we have to go on a retry, allocating/inserting/updating it again. This is quite problematic for batched API, in which I split the whole attachment into few independent phase: - preallocate uprobe instances (for all consumers/uprobes) - insert them or reuse pre-existing ones (again, for all consumers in one batch, protected by single writer lock on uprobes_treelock); - then register/apply for each VMA (you get it, for all consumers in one go). Having this retry for some of uprobes because of this race is hugely problematic, so I wanted to make it cleaner and simpler: once you manage to insert/reuse uprobe, it's not going away from under me. Which is why the change to refcounting schema. And I think it's a major improvement. We can argue about refcount_inc_not_zero vs this custom refcounting schema, but I think the change should be made. Now, imagine I also did all the seqcount and RCU stuff across entire uprobe functionality. Wouldn't that be mind bending a little bit to wrap your head around this?