Re: uverbs radix tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 26, 2018 at 02:08:15PM -0700, Matthew Wilcox wrote:
> On Wed, Sep 26, 2018 at 02:44:46PM -0600, Jason Gunthorpe wrote:
> > On Wed, Sep 26, 2018 at 01:23:41PM -0700, Matthew Wilcox wrote:
> > > CPU A:
> > >
> > >         srcu_key = srcu_read_lock(&file->device->disassociate_srcu);
> > >         err = ib_uverbs_cmd_verbs(file, &hdr, user_hdr->attrs);
> > > ->
> > >         slot = radix_tree_iter_lookup(
> > >                 &uapi->radix, &attrs_iter,
> > >                 uapi_key_obj(hdr->object_id) |
> > >                         uapi_key_ioctl_method(hdr->method_id));
> > > 
> > > CPU B:
> > >         radix_tree_for_each_slot (slot, &uapi->radix, &iter, 0) {
> > >                 kfree(rcu_dereference_protected(*slot, true));
> > >                 radix_tree_iter_delete(&uapi->radix, &iter, slot);
> > >         }
> > > 
> > > radix_tree_iter_delete() ends up calling radix_tree_node_free()
> > > which does a call_rcu(&node->rcu_head, radix_tree_node_rcu_free)
> > 
> > > You're holding an srcu_read_lock(), not an rcu_read_lock(), so
> > > as far as RCU is concerned, you're not within a grace period, and
> > > it doesn't need to wait for you.
> > 
> > Hmm....
> > 
> > This usage of the radix tree is not using RCU. The tree is created at
> > the start of time and is used read-only until the final kref is put,
> > then it is destroyed.
> > 
> > The race above cannot exist because the kref on file->device will not
> > go to zero while CPU A is running that code above. Specifically the
> > file's kref is held by a struct file * and the fs core will hold the
> > kref on struct file during execution of the ioctl file_operation.
> 
> Ah, OK.  That makes sense, but good luck explaining it to lockdep ;-)

I haven't noticed anything weird with lockdep, where do you see it
will have trouble?

> > I think the mistake here is the misleading use of srcu_dereference():
> > 
> >    srcu_dereference(
> > 		*slot,
> > 		&pbundle->bundle.ufile->device->disassociate_srcu);
> > 
> > That should just be rcu_dereference_protected(*slot, true), I think?
> 
> aka rcu_dereference_raw()?  Yes, that's probably better.

I've been following the docs in Documentation/RCU/lockdep.txt:

        rcu_dereference_raw(p):
                Don't check.  (Use sparingly, if at all.)
        rcu_dereference_protected(p, c):
                Use explicit check expression "c", and omit all barriers
                and compiler constraints.  This is useful when the data
                structure cannot change, for example, in code that is
                invoked only by updaters.

In this case 'the data structure cannot change' as we are protected
from that by kref.

Do you know if the docs are wrong now and _raw() is now the right
choice?

> > > I'm _really_ not a fan of the way you pull various things out of the
> > > radix_tree_iter and keep references to them in the pbundle.  Can we
> > > figure out a better way to do this?  I don't have a clear picture of
> > > your requirements at this stage.
> > 
> > Yes, I figured that would come up with your xarray work.
> > 
> > All that is needed here is a 'lookup with hint' API. Something like:
> > 
> >  struct iter last_position;
> > 
> >  lookup(0x120, &last_position);
> >  lookup_hint(0x123, &last_position);
> >  lookup_hint(0x124, &last_position);
> > 
> > Where lookup_hint is fast if last_position already points at the
> > correct chunk. Otherwise it will search again and update
> > last_position.
> > 
> > The last time I looked at the xarray code I thought this would be
> > straightforward to add as an API, and I was thinking of proposing
> > something like the above when xarray is finally merged.
> 
> I can work with that.  I'll send a suggested patch tomorrow.
> Probably something like void *xas_move(struct xa_state *, unsigned
> long index); I already have xas_next() and xas_prev() which do the
> moral equivalent of p = xa[++i] and p = xa[--i], so having an
> xas_move() would fit in.

So we could write it as:

 XA_STATE(iter, radix, 0x120);

 value = xas_load(&iter);
 value = xas_move(0x124, &iter); /* ie + 4 */
 value = xas_move(0x123, &iter); /* ie - 1 */

?

I guess it is worth pointing out the only reason it is like this is
performance on a system call path. The radix tree version is one
compare and an add in the typical case, it would be nice to stay
within a similar envelope.

ie update of the state's offset and index on every lookup is not
something that is really needed for this use case.

Also, currently the info is in the 'pbundle' thing.. Does this need
some rework to keep the XA_STATE separated on the stack? Or something
else?

Are you looking at this because you plan to convert this code in the
next merge window? Do you have a way we can merge that conversion
through the RDMA git tree? I am willing to help with that.. Otherwise
I fear we will have monster conflicts for Linus.

I was expecting to do the conversion in the cycle after xarray
lands...

Thanks,
Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux