RE: [PATCH v2 19/20] IB/rdmavt, IB/qib, IB/hfi1: Make percpu refcount optional for user MRs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Umm.. This doesn't look like a refcount, it is a rwlock - why aren't you using
> the optimized percpu_rwsem?
> 

The refcount with a completion has been in qib and rdmavt for years without issue.

It is the best way to express having an MR's underlying pages held until all of the "users" have finished.

When the count gets to zero, the completion is triggered and allows a deregistration to complete.

What we have found is that as lkey and rkey validation calls scale out to many cores/threads, the atomic operation causes excessive cache bouncing.

That is the motivation for using the percpu reference counting: to avoid the bouncing in scaled out data path operations.

The information that is in include/linux/percpu-refcount.h and lib/percpu-refcount.c has the API documentation and our use is consistent with other callers and was pretty much a drop in replacement for our older atomic operations.

>From the above header file:
 * This implements a refcount with similar semantics to atomic_t - atomic_inc(),
 * atomic_dec_and_test() - but percpu.

The percpu rwsem seems more a drop-in replacement for the older rwlock_t stuff.

> ... micro-optimize it is fantastically ugly/bad
> taste.
> 

All this being said, we have encountered a use case where the MR is short lived and supports just one transaction.

In that case, the RCU quiescence during deregistration IS the performance bottleneck.    As cores scale out, the RCU grace period can cause large delays.

I have a prototype patch to pass a hint (no module parameter) to the user MR registration via the access flags.

Before (no hint):
    -- Alloc memory: 4 us
    -- Zero memory: 1933 us
    -- Register: 112 us
    -- Unregister: 6086 us <------
    -- Free memory: 89 us

After (with hint):
    -- Alloc memory: 7 us
    -- Zero memory: 1929 us
    -- Register: 111 us
    -- Unregister: 49 us   <------
    -- Free memory: 85 us

I don't think a two order of magnitude improvement is a micro optimization.

Note that in percpu-rw-semaphore.txt:
   Locking for reading is very fast, it uses RCU and it avoids any atomic
   instruction in the lock and unlock path. On the other hand, locking for
   writing is very expensive, it calls synchronize_rcu() that can take
   hundreds of milliseconds.

So the RCU grace period is problematic in this context as well.

Mike


  


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux