On Sun, Nov 15, 2020 at 03:35:51PM -0500, Kent Overstreet wrote: > On Sun, Nov 15, 2020 at 12:20:17PM -0800, Paul E. McKenney wrote: > > On Sun, Nov 15, 2020 at 12:11:24PM -0800, Paul E. McKenney wrote: > > > On Thu, Nov 12, 2020 at 03:15:47PM -0500, Kent Overstreet wrote: > > > > Hi to Paul & the rcu mailing list, > > > > > > > > I've got a problem I'm trying to figure out if I can adapt SRCU for. > > > > > > > > Within bcachefs, currently struct btree obects, and now also btree_cached_key, > > > > aren't freed until the filesystem is torn down, because btree iterators contain > > > > pointers to them and will drop and retake locks (iff a sequence number hasn't > > > > changed) without holding a ref. With the btree key cache code, this is now > > > > something I need to fix. > > > > > > > > What I plan on doing is having struct btree_trans (container for btree > > > > iterators) hold an srcu read lock while it's alive. But, I don't want to just > > > > use call_srcu to free the btree/btree_cached_key objects, because btree trans > > > > objects can at times be fairly long lived and the existing code can reuse these > > > > objects for other btree nodes/btree cached keys immediately. Freeing them with > > > > call_srcu() would break that; I could have my callback function check if the > > > > object has been reused before freeing, but I'd still have a problem when the > > > > object gets freed a second time before the first call_scru() has finished. > > > > > > > > What I'm wondering is if the SRCU code has a well defined notion of a clock that > > > > I could make use of. What I would like to do is, instead of doing call_srcu() to > > > > free the object - just mark that object with the current SRCU context time, and > > > > then when my shrinkers run they can free objects that haven't been reused and > > > > are old enough according the the current SRCU time. > > > > > > > > Thoughts? > > > > > > An early prototype is available on -rcu [1]. The Tree SRCU version > > > seems to be reasonably reliable, but I do not yet trust the Tiny SRCU > > > implementation. So please avoid !SMP&&!PREEMPT if you would like to > > > give it a try. Unless you -really- like helping me find bugs, in which > > > case full speed ahead!!! > > > > > > Here is the API: > > > > > > unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp) > > > > > > Returns a "cookie" that can be thought of as a snapshot of your > > > "clock" above. (SRCU calls it a "grace-period sequence number".) > > > Also ensures that enough future grace periods happen to eventually > > > make the grace-period sequence number reach the cookie. > > > > > > bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie) > > > > > > Given a cookie from start_poll_synchronize_srcu(), returns true if > > > at least one full SRCU grace period has elapsed in the meantime. > > > Given finite SRCU readers in a well-behaved kernel, the following > > > code will complete in finite time: > > > > > > cookie = start_poll_synchronize_srcu(&my_srcu); > > > while (!poll_state_synchronize_srcu(&my_srcu, cookie)) > > > schedule_timeout_uninterruptible(1); > > > > > > unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) > > > > > > Like start_poll_synchronize_srcu(), except that it does not start > > > any grace periods. This means that the following code is -not- > > > guaranteed to complete: > > > > > > cookie = get_state_synchronize_srcu(&my_srcu); > > > while (!poll_state_synchronize_srcu(&my_srcu, cookie)) > > > schedule_timeout_uninterruptible(1); > > > > > > Use this if you know that something else will be starting the > > > needed SRCU grace periods. This might also be useful if you > > > had items that were likely to be reused before the SRCU grace > > > period elapsed, so that you avoid burning CPU on SRCU grace > > > periods that prove to be unnecessary. Or if you don't want > > > to have more than (say) 100 SRCU grace periods per seconds, > > > in which case you might use a timer to start the grace periods. > > > Or maybe you don't bother starting the SRCU grace period until > > > some sort of emergency situation has arisen. Or... > > > > > > OK, maybe you don't need it, but I do need it for rcutorture, > > > so here it is anyway. > > > > > > All of these can be invoked anywhere that call_srcu() can be invoked. > > > > > > Does this look like it will work for you? > > Yeah. My one quibble is, instead of having to call poll_state_synchronize_srcu() > on every cookie - would it be possible to get function that returns a cookie we > can compare against (i.e. with time_after())? I'm going to be use this in the > shrinker where we have to walk and check potentially tens of thousands of > objects. If the cookies compare equal, poll_state_synchronize_srcu() will treat them the same. If that does not help, could you please show me a code snippet illustrating what you would like to do? (Yes, even if equality comparison works, I probably need to give you an API member just in case the nature of grace-period sequence numbers changes in the future.) > > Oh, and due to historical inertia, Tiny SRCU's grace-period sequence > > number is only 16 bits. I can change this easily, but I need to know > > that it is a real problem for you before I can do so. > > > > The potential problem for you is that if you let a given cookie lie > > dormant for 16384 grace periods, it will take another 16385 grace > > periods for get_state_synchronize_srcu() to say that a grace period > > has elapsed. > > > > In contrast, Tree SRCU's grace-period sequence number is either 32 bits > > or 64 bits, depending on the size of unsized long. > > It's not something I'd lose sleep over, but I think it could be. If there isn't > memory pressure, then the shrinker won't be running and we won't be freeing the > objects with the oldest cookies, but freeing them internally will be creating > new grace periods - but if I make sure we're reusing objects in LIFO order would > also work against the shrinker actually being able to free any objects, so not > sure I want to do that... OK, I will leave it, at least until you tell me otherwise. I probably need to add a warning to the header comment, though... Thanx, Paul