On Thu, Jun 13, 2024 at 08:22:19PM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 02:32:27PM -0400, Kent Overstreet wrote: > > On Wed, Jun 12, 2024 at 08:48:40PM -0700, Paul E. McKenney wrote: > > > The number two is the number of distinct values, which will not be > > > consecutive numbers. As you noticed, they currently are always multiples > > > of four, but that is an accident of implementation that is subject to > > > change. > > > > I take it you're stuffing something else into the low bits and masking > > it off, and you can't shift that off because that would break > > comparisons. That's a tiny bit unfortunate; it would be nicer if we > > could do array indexing based on the sequence number. > > That was my initial thought as well, and so there was in fact a time > when RCU grace periods were numbered consecutively. The problem with > that was that it was necessary to maintain a separate indication of > whether or not a grace period was in progress. It is often necessary > to atomically determine both the grace-period number and whether that > grace period is still in progress, which required acquiring a lock, > which necessitated a number of contention-reduction workarounds which > eventually became problematic. > > When led to the next phase, which reserved the bottom bit to indicate > whether a grace period was in progress and left the remaining bits to > count the grace periods. This greatly simplified the code and likely > also improve performance and scalability. > > Then I implemented Tree SRCU, which requires multiple grace-period phases, > hence the current state with the two bottom bits indicating grace=-period > phase and the remaining bits counting the grace periods. Hence your > seeing get_state_synchronize_srcu() always returning a multiple of > four ("grace period N has not yet started"). I've been wondering about your tree algorithm, ever since I realized there's an elegant algorithm for ticking a counter when every counter in a tree of sub counters has ticked. > > What if we added something like this? > > > > #define RCU_SEQ_SKIP_SHIFT 2 > > Right now, get_state_synchronize_srcu() returns the grace-period sequence > number from the root of the tree. This has been working fine thus far, > but it is quite possible that memory-contention issues will force me to > get the value from the leaves of this tree. If that happens, the value > of NUM_ACTIVE_SRCU_POLL_OLDSTATE will change from 2 to 3. > > In addition, vanilla RCU users are looking for reduced synchronize_rcu() > latency, which in the past resulted in the addition of expedited > grace periods, which have their own counter. Similar pressure on SRCU > could result in overlapping computed grace periods, which would use > additional bits and make the cookies only partially ordered (as the RCU > rcu_gp_oldstate structure variant of cookie already is). > > And those are just a couple of changes that I can foresee. There might > be others. > > So, annoying thought it might be, we really need for this to remain an > opaque cookie. Ok, I can live with that. > > > > I could use a doubly linked list for tracking all uncompleted objects, > > > > but I'm hoping use a darray per sequence number. Is it a max of two > > > > sequence numbers that had objects/callbacks/were requested pulling, > > > > or...? > > > > > > Agreed, an array should work. Each array element would need a field > > > recording its cookie. Right, so you likely need a few more simple > > > functions: > > > > > > same_state_synchronize_srcu(): Equality comparison of the > > > pair of cookies passed in. > > > get_completed_synchronize_srcu(): Return a special cookie > > > that, when passed to poll_state_synchronize_srcu(), > > > always results in a @true return value. > > > > > > Or maybe not, you tell me. > > > > I don't think so, but I'm not sure what your intended use is. > > For same_state_synchronize_srcu(), places where you likely planned to > instead use C-language comparisons. For get_completed_synchronize_srcu(), > as a way of marking an unused entry, for which it looks like you are > currently using ->rcu_armed. I'm definitely not following how you want that done then :) > > dynamic array, basically a c++ vector - fs/bcachefs/darray.h > > So the darray is to hold the callbacks themselves as opposed to > the per-unexpired-cookie-value groups of callbacks? For my intended use case, a struct pending_rcu_items is only for objects of the same type that all have the same callback (otherwise, allocating from pending objects would have to search for an object of a given type). But if we want to use this for the call_rcu()/call_srcu() mechanism, then yes - the darray would have elements of type struct rcu_pending_cb { void (*func)(struct rcu_head *); struct rcu_head *obj; }; which would save us a dependent load on processing callbacks. > > Think this would be worth factoring out and turning into something > > generic? If RCU wanted to use this internally for tracking pending > > callbacks, we could get rid of linked list overhead there... > > For the kfree_rcu() path, we have a pages-of-pointers approach. For the > call_rcu() path, the callbacks almost always access the memory, so it > is not clear that the linked-list overhead is an issue. You get that > cold-cache miss either way. So kfree_rcu() is basically implementing the same thing? yes, add_ptr_to_bulk_krc_lock(). The pages-of-pointers does avoid the realloc/memcpy() overhead - which implies I'll want to switch mine from a darray to a generic radix tree. I think call_rcu() would benefit from this approach as well - it's true that we can't avoid cache misses, but linked lists are completely serialized, so those cache misses hurt; using a vector or radix tree means we can prefetch.