On Fri, Sep 28, 2018 at 04:19:11PM -0600, Jason Gunthorpe wrote: > On Fri, Sep 28, 2018 at 02:24:40PM -0700, Matthew Wilcox wrote: > They are not the same.. > > _raw has barriers, while _protected does not. > > ie _raw is the same as rcu_dereference except that it doesn't lockdep > check the caller is in a RCU critical region. > > It looks like _protected is supposed to be used if the caller has > excluded writers (ie this case), and since we are excluding by control > flow not locking the 'true' is appropriate. 'git grep' shows it is > pretty common to pass true here, maybe even the most common usage.. Yes, there do seem to be a lot of callers with 'true' as the second parameter. > > I wasn't expecting current users of the radix tree to help ;-) And I was > > getting a head start on some of the harder conversions so I could send > > them to maintainers early. But yes, my plan is to send the conversions > > to maintainers for merging after the base XArray code lands, not try and > > mass-convert radix trees to xarrays through my own tree. > > Sure, let me know, I can help you out with rdma. I appreciate it! And I appreciate the discussion here ... > > Meanwhile I've come up with another problem which is that the plan is to > > make xarray nodes movable. That means that *even if the xarray node is > > pinned* by your own code, the xarray may decide to relocate the entries > > if you're not holding the RCU read lock or the xarray lock. > > This is Christoph's work? Yes, Christoph and I have been working on it for almost two years ... while getting distracted with other projects. > Ah, I see, xarray has to hold a RCU critical section all over the > place and radix tree did not.. Hum, I think uverbs is OK with that > except for this one case. Right, radix tree said "the locking is up to the callers" where the XArray defines what its locking is. > > Do you really need the hint? For a three level raidx tree, that's only > > 6 additional cachelines. Prmtr optmstn s th rt f ll vl. > > Sounds like it is some SMP barriers now that xarray requires RCU as > well as some function calls and atomics? Looped 10-20 times per system > call :( already hot cache lines seem like the least worry here. Umm, I'm not sure I see the atomics in the xa_load path. void *xa_load(struct xarray *xa, unsigned long index) { XA_STATE(xas, xa, index); void *entry; rcu_read_lock(); do { entry = xas_load(&xas); } while (xas_retry(&xas, entry)); rcu_read_unlock(); return entry; } Other than debugging code, there's no atomics in rcu_read_lock. void *xas_load(struct xa_state *xas) { void *entry = xas_start(xas); while (xa_is_node(entry)) { struct xa_node *node = xa_to_node(entry); if (xas->xa_shift > node->shift) break; entry = xas_descend(xas, node); } return entry; } The only thing that looks even remotely barrier or atomic in here is xa_entry where it calls: return rcu_dereference_check(node->slots[offset], lockdep_is_held(&xa->xa_lock)); which is only going to insert barriers on Alpha. This is better than what the radix tree does which uses rcu_dereference_raw all over the place with an occasional rcu_dereference(). > The system call path is one of the things we consider a fast path in > the subsystem so care has been taken to keep things out of the > way. This code got a nice 6% bump when this was moved from the > open-coded tree scheme to radix tree, in part because we get this nice > fast 3 level lookup, compared to the deeper and very branch-heavy > version that was before. > > However this code is still being renovated and we haven't yet put the > performance critical stuff into ioctl. So if the xarray can't handle > this then efficiently please just replace with a full lookup. > > We can benchmark again and if it is bad we can can burn some memory to > pre-compute equivalent linear lookup tables under uverbs's control so > the usage don't trouble xarray. That might be better ... just use the xarray to look up the base of your own private data structure and put a pointer to that into your bundle_priv.