On Mon, Aug 26, 2024 at 11:17:29AM -0400, Kent Overstreet wrote: > On Mon, Aug 26, 2024 at 07:44:12AM GMT, Paul E. McKenney wrote: > > I have gotten requests for a variant of SRCU that dispenses with the > > read-side memory barriers, which would allow you to dispense with RCU. > > Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to > > the _nmisafe() variants. There is always a price, and in this case the > > price would be that srcu_read_lock_lite() and srcu_read_unlock_lite() > > be restricted to code regions where RCU is watching. But from what I > > know, fs code must already abide by that restriction. > > > > Would that help? > > I don't think that helps here, but that is something I'd be interested > in. > > What I would like for this is: > - a single #define for RCU_NR_OLDSTATES for both RCU and SRCU > > - a way of getting the current RCU gp sequence number without a > function call, since we hit that on every single enqueue(). One of my > development versions added a function to RCU and SRCU for getting a > pointer to the internal sequence number, which we'd call at init > time; this lets us skip the function call and a branch. > > Another thing that might make the code a bit easier to reason about is > if rcu_read_lock() also worked as an srcu_read_lock() - is something the > SRCU variant you're talking about previously would provide? Yes, srcu_read_lock_lite() would return an value that you later pass to the corresponding srcu_read_unlock_lite(), just as srcu_read_lock() and srcu_read_unlock() do now. And they would have the same grace-period algorithm, and thus the same value for NUM_ACTIVE_SRCU_POLL_OLDSTATE, as requested above. But get_state_synchronize_srcu() is still a function call. But the offer of a function that checks multiple grace-period-state cookies in one call still stands. > In my current version of the code, we call __get_state_synchronize_rcu() > after we've disabled interrupts; we know the rcu gp seq isn't going to > tick while we're in the critical path. But this doesn't apply if it's > for SRCU, and I don't want to add if (src) srcu_read_lock() branches to > it. Actually, disabling interrupts does *not* prevent RCU's grace-period sequence number from changing. For example, the following really can happen: o RCU notices that the current grace period can end. o A CPU disables interrupts here. o A call to get_state_synchronize_rcu() returns a cookie corresponding to the end of the grace period following the current one. o RCU ends the current grace period, therefore updating the grace-period sequence number. o RCU starts a new grace period, therefore updating the grace-period sequence number once again. o RCU cannot complete this new grace period until that CPU re-enables interrupts, but it has already updated its grace-period sequence number not once, but twice. So if your code knows that RCU's grace-period sequence number cannot change while a given CPU has interrupts disabled, that code has a bug. A low-probability bug, perhaps, but if your code is running on enough systems, it will make its presence known. > Not at all essential, the races that result from this are harmless, but > if we e.g. decide it's worthwhile to only kick off a gp if it hasn't > ticked (i.e. only kick rcu if we're still on seq of the object being > enqueued) it'd be nice. Why not call get_state_synchronize_rcu(), and ask for a new grace period if the value returned is different than that from the previous call? > > > On the enqueue side, which is the fastpath, this uses a cursor - we're > > > not walking the radix tree every time. > > > > > > On the processing side, where we're kfree_bulk()ing entire radix tree > > > nodes, the radix tree is going to have better cache locality than a list > > > of pages. > > > > Interesting. What testing or analysis did you do in the course of > > arriving at this conclusion? In the meantime I will assume that the > > analysis involves your radix-tree nodes being more than one page in size. > > No, the radix tree nodes are 512 bytes; that's been sufficient in my > testing. > > (also, please don't refer to PAGE_SIZE outside of mm/ code without a > _good_ reason; that's something we've been trying to clean up.) Moving kfree_rcu() into mm/ will fix that problem. Plus we did discuss this with the mm folks back in the day, especially about the GFP flags, so apparently there was a sufficiently good reason. Maybe still is. > I'll try to post some performance numbers when I have some time. I look forward to seeing what you come up with. > > It might or might not be reasonable, depending on what is going on in the > > rest of the system. The current kfree_rcu() code can run the system out > > of full pages, but it won't impede other code allocating smaller blocks > > of memory. We could easily change it to allocate individual rcu_head > > structures, but doing so would not necessarily be a win in OOM conditions, > > again, depending on what else is going on. > > As long as the thread calling kvfree_rcu_mightsleep() can block without > blocking memory reclaim it'll be safe. We might want to tweak the GFP > flags so to tell the allocator "don't try too hard, bail out so we can > check if the gp has expired". There can be quite a difference between "safe" and "does well in actual workloads". > > I took a quick look at __rcu_pending_enqueue() and my eyes are still > > bleeding. Concatenating linked list of pages is way simpler, way faster, > > and way more robust. > > Funny, I had the same thoughts trying to read your code... :) Amazing how much easier it is to generate new code than to understand existing code, isn't it? ;-) > But, most of __rcu_pending_enqueue() is slowpaths; the fastpath is just > > objs = get_object_radix(p, seq); /* lookup seq in p->objs */ > > *objs->cursor++ = ptr ?: head; > /* zero cursor if we hit the end of a radix tree node: */ > if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1))) > objs->cursor = NULL; > start_gp = !objs->nr; > objs->nr++; > > So I think the performance concerns are moot, and for robustness - > memory allocation failure always turns into "use the linked lists of > objects", which works similarly to the old code. Bugs live on slowpaths as well as on fastpaths. In fact, they tend to accumulate there. > > > But yes, using it for call_rcu() would need more performance > > > justification. I haven't seen workloads where call_rcu() performance > > > particularly matters, so it's not something I'm pushing for - I included > > > that because it's minimal code and other people might know of workloads > > > where we do want it. > > > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback > > functions usually access the memory block passed to them, which means > > that they are incurring that per-element cache miss in any case. > > True. But this would allow us to prefetch those objects (several > iterations in advance). I need to see a CPU on which this actually make a significant difference before adding this sort of complexity. > > > > > Ideally we would be getting a callback every time a grace period > > > > > completes for which we have objects, but that would require multiple > > > > > rcu_heads in flight, and since the number of gp sequence numbers with > > > > > uncompleted callbacks is not bounded, we can't do that yet. > > > > > > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items() > > > > serve this purpose? After all, the case that causes trouble is the one > > > > where lots of things are being very frequently queued. > > > > > > No, because that's unpredictable, and we don't process pending items in > > > enqueue() unless we know we can sleep (i.e., we don't do it if the > > > caller is latency sensitive). > > > > It is possible to do this in an extremely lightweight fashion in the > > common case. > > Just processing a few items? hmm, would we want to though, when > otherwise we'd be calling kfree_bulk()? I think icache locality means > we'd generally prefer not to. You might not want to yet, but you eventually would want this. Thanx, Paul