Re: [PATCH v2 19/20] IB/rdmavt, IB/qib, IB/hfi1: Make percpu refcount optional for user MRs

Leon Romanovsky <leon@xxxxxxx> · Sun, 9 Apr 2017 09:26:02 +0300

On Fri, Apr 07, 2017 at 04:06:18PM -0600, Jason Gunthorpe wrote:
> On Fri, Apr 07, 2017 at 09:12:34PM +0000, Marciniszyn, Mike wrote:
> > > Umm.. This doesn't look like a refcount, it is a rwlock - why aren't you using
> > > the optimized percpu_rwsem?
> > >
> >
> > The refcount with a completion has been in qib and rdmavt for years
> > without issue.
>
> Doesn't change the fact this isn't a refcount behavior, it is a rwsem
> with write lock on destroy. A proper refcounf would destroy the object
> not call a completion.
>
> Doing things properly using the common primitives makes stuff work
> better, eg percpu_rwsem has sane lockdep.
>
> > All this being said, we have encountered a use case where the MR is
> > short lived and supports just one transaction.
>
> Well, yes, that is a pretty common idiom in kernel workloads too..
>
> > I have a prototype patch to pass a hint (no module parameter) to the
> > user MR registration via the access flags.
>
> Okay, so you'd have a IBV_MR_MULTI_THREADED to enable the RCU
> optimization?

It is not needed for kernel paths (RCU optimization).
There is get_nr_threads(struct task_struct *tsk) call to get number of threads.
However I don't know if it is appropriate to use that function in driver code.

If the goal to optimize the user space drivers, indeed the flag will be needed.

>
> That seems sort of consistent with some of the other flags we've had
> in the past (eg single threaded CQ polling optimization)
>
> > I don't think a two order of magnitude improvement is a micro optimization.
>
> The micro optimization was tring to optimize rwlock with percpu and
> RCU. The two order of magnitude penalty on the destroy and the new
> need for tuning knobs is the penalty for that.
>
> I doubt the percpu optimization was two orders of magnitude..
>
> > So the RCU grace period is problematic in this context as well.
>
> Of course, RCU is not designed to have these kinds of performance
> characteristics. If you define destroy to be a hot path then you can't
> use RCU here, the worst case RCU grace period times are potentually
> quite big..
>
> This is why you shouldn't have the RCU optimization on by default at
> all.
>
> Usually RCU grace period latency is solved by defering the write side
> to an async rcu grace period callback - why not do that instead of
> adding a flag? It feels like destroy is a reasonable candidate to do
> that kind of trick.
>
> Perhaps some kind of enhancement to percpu_rwsem such that it would
> asynchronously call a function with the write side lock held? Looks
> not to hard..
>
> Jason
Attachment:
signature.asc

Description: PGP signature