On Fri, Apr 07, 2017 at 04:06:18PM -0600, Jason Gunthorpe wrote: > On Fri, Apr 07, 2017 at 09:12:34PM +0000, Marciniszyn, Mike wrote: > > > Umm.. This doesn't look like a refcount, it is a rwlock - why aren't you using > > > the optimized percpu_rwsem? > > > > > > > The refcount with a completion has been in qib and rdmavt for years > > without issue. > > Doesn't change the fact this isn't a refcount behavior, it is a rwsem > with write lock on destroy. A proper refcounf would destroy the object > not call a completion. > > Doing things properly using the common primitives makes stuff work > better, eg percpu_rwsem has sane lockdep. > > > All this being said, we have encountered a use case where the MR is > > short lived and supports just one transaction. > > Well, yes, that is a pretty common idiom in kernel workloads too.. > > > I have a prototype patch to pass a hint (no module parameter) to the > > user MR registration via the access flags. > > Okay, so you'd have a IBV_MR_MULTI_THREADED to enable the RCU > optimization? It is not needed for kernel paths (RCU optimization). There is get_nr_threads(struct task_struct *tsk) call to get number of threads. However I don't know if it is appropriate to use that function in driver code. If the goal to optimize the user space drivers, indeed the flag will be needed. > > That seems sort of consistent with some of the other flags we've had > in the past (eg single threaded CQ polling optimization) > > > I don't think a two order of magnitude improvement is a micro optimization. > > The micro optimization was tring to optimize rwlock with percpu and > RCU. The two order of magnitude penalty on the destroy and the new > need for tuning knobs is the penalty for that. > > I doubt the percpu optimization was two orders of magnitude.. > > > So the RCU grace period is problematic in this context as well. > > Of course, RCU is not designed to have these kinds of performance > characteristics. If you define destroy to be a hot path then you can't > use RCU here, the worst case RCU grace period times are potentually > quite big.. > > This is why you shouldn't have the RCU optimization on by default at > all. > > Usually RCU grace period latency is solved by defering the write side > to an async rcu grace period callback - why not do that instead of > adding a flag? It feels like destroy is a reasonable candidate to do > that kind of trick. > > Perhaps some kind of enhancement to percpu_rwsem such that it would > asynchronously call a function with the write side lock held? Looks > not to hard.. > > Jason
Attachment:
signature.asc
Description: PGP signature