On Wed, Sep 26, 2018 at 04:02:20PM -0600, Jason Gunthorpe wrote: > On Wed, Sep 26, 2018 at 02:08:15PM -0700, Matthew Wilcox wrote: > > On Wed, Sep 26, 2018 at 02:44:46PM -0600, Jason Gunthorpe wrote: > > > This usage of the radix tree is not using RCU. The tree is created at > > > the start of time and is used read-only until the final kref is put, > > > then it is destroyed. > > > > > > The race above cannot exist because the kref on file->device will not > > > go to zero while CPU A is running that code above. Specifically the > > > file's kref is held by a struct file * and the fs core will hold the > > > kref on struct file during execution of the ioctl file_operation. > > > > Ah, OK. That makes sense, but good luck explaining it to lockdep ;-) > > I haven't noticed anything weird with lockdep, where do you see it > will have trouble? What I mean is, you're defeating lockdep by passing "true" as the second argument. It can't validate that your assumptions are correct. I suppose instead of "true", you could write "kref > 0" or whatever. > > > I think the mistake here is the misleading use of srcu_dereference(): > > > > > > srcu_dereference( > > > *slot, > > > &pbundle->bundle.ufile->device->disassociate_srcu); > > > > > > That should just be rcu_dereference_protected(*slot, true), I think? > > > > aka rcu_dereference_raw()? Yes, that's probably better. > > I've been following the docs in Documentation/RCU/lockdep.txt: > > rcu_dereference_raw(p): > Don't check. (Use sparingly, if at all.) > rcu_dereference_protected(p, c): > Use explicit check expression "c", and omit all barriers > and compiler constraints. This is useful when the data > structure cannot change, for example, in code that is > invoked only by updaters. > > In this case 'the data structure cannot change' as we are protected > from that by kref. > > Do you know if the docs are wrong now and _raw() is now the right > choice? This is further down the rcu rabbithole than I'm comfortable with. I see _raw as "I know what I'm doing" ... I don't think anyone foresaw people using 'true' as the condition to _protected. I think they're equivalent and '_protected(x, true)' is just a more verbose way of spelling '_raw(x)'. > > > > I'm _really_ not a fan of the way you pull various things out of the > > > > radix_tree_iter and keep references to them in the pbundle. Can we > > > > figure out a better way to do this? I don't have a clear picture of > > > > your requirements at this stage. > > > > > > Yes, I figured that would come up with your xarray work. > > > > > > All that is needed here is a 'lookup with hint' API. Something like: > > > > > > struct iter last_position; > > > > > > lookup(0x120, &last_position); > > > lookup_hint(0x123, &last_position); > > > lookup_hint(0x124, &last_position); > > > > > > Where lookup_hint is fast if last_position already points at the > > > correct chunk. Otherwise it will search again and update > > > last_position. > > > > > > The last time I looked at the xarray code I thought this would be > > > straightforward to add as an API, and I was thinking of proposing > > > something like the above when xarray is finally merged. > > > > I can work with that. I'll send a suggested patch tomorrow. > > Probably something like void *xas_move(struct xa_state *, unsigned > > long index); I already have xas_next() and xas_prev() which do the > > moral equivalent of p = xa[++i] and p = xa[--i], so having an > > xas_move() would fit in. > > So we could write it as: > > XA_STATE(iter, radix, 0x120); > > value = xas_load(&iter); > value = xas_move(0x124, &iter); /* ie + 4 */ > value = xas_move(0x123, &iter); /* ie - 1 */ > > ? > > I guess it is worth pointing out the only reason it is like this is > performance on a system call path. The radix tree version is one > compare and an add in the typical case, it would be nice to stay > within a similar envelope. > > ie update of the state's offset and index on every lookup is not > something that is really needed for this use case. > > Also, currently the info is in the 'pbundle' thing.. Does this need > some rework to keep the XA_STATE separated on the stack? Or something > else? > > Are you looking at this because you plan to convert this code in the > next merge window? Do you have a way we can merge that conversion > through the RDMA git tree? I am willing to help with that.. Otherwise > I fear we will have monster conflicts for Linus. > > I was expecting to do the conversion in the cycle after xarray > lands... I wasn't expecting current users of the radix tree to help ;-) And I was getting a head start on some of the harder conversions so I could send them to maintainers early. But yes, my plan is to send the conversions to maintainers for merging after the base XArray code lands, not try and mass-convert radix trees to xarrays through my own tree. Meanwhile I've come up with another problem which is that the plan is to make xarray nodes movable. That means that *even if the xarray node is pinned* by your own code, the xarray may decide to relocate the entries if you're not holding the RCU read lock or the xarray lock. Do you really need the hint? For a three level raidx tree, that's only 6 additional cachelines. Prmtr optmstn s th rt f ll vl.