Re: [PATCH v1] svcrdma: Optimize the logic that selects the R_key to invalidate

Tom Talpey <tom@xxxxxxxxxx> · Tue, 27 Nov 2018 20:13:14 -0500

On 11/27/2018 5:23 PM, Chuck Lever wrote:

On Nov 27, 2018, at 4:30 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 11/27/2018 4:21 PM, Chuck Lever wrote:
On Nov 27, 2018, at 4:16 PM, Tom Talpey <tom@xxxxxxxxxx> wrote:

On 11/27/2018 11:11 AM, Chuck Lever wrote:
o Select the R_key to invalidate while the CPU cache still contains
   the received RPC Call transport header, rather than waiting until
   we're about to send the RPC Reply.
o Choose Send With Invalidate if there is exactly one distinct R_key
   in the received transport header. If there's more than one, the
   client will have to perform local invalidation after it has
   already waited for remote invalidation.

What's the reason for remote-invalidating only if exactly one
region is targeted? It seems valuable to save the client the work,
no matter how many regions are used.
Because remote invalidation delays the Receive completion.

Well yes, but the invalidations have to happen before the reply is
processed, and remote invalidation saves a local work request plus
its completion.

That is true only if remote invalidation can knock down all the
R_keys for that RPC. If there's more than one R_key for that RPC,
a local invalidation is needed anyway, and there's no savings but
rather there is a cost of the extra latency of waiting twice.

A couple of details to note:
- remote invalidation is only available with FRWR, which
   invalidates asynchronously
- a smart FRWR client implementation will post a chain of LOCAL
   INV WRs, then wait for the last one to signal completion. That's
   just one doorbell, one interrupt, and one context switch no
   matter how many LOCAL INV WRs are needed.

So if the client still has to do even one local invalidation, it's
not worth the trouble to remotely invalidate.

I still don't agree about "not worth" it, but it's a choice.

Just a couple of other notes:

Have you measured the difference?

Yes, as reported in the patch description. Perhaps I can include
some interesting iozone results.

I didn't see anything about this in the patch description, but I was
not arguing for including this kind of detail, just whether you had
actually measured it. I'm interested in that, btw.

This behavior seems to be a typical feature of most recent hardware.
I suspect there's some locking of the hardware Send queue to handle
RI that contends with actual posted WRs from the host.

That would be really bad. Have you reported this to the vendors?

With cards that have a shallow FR depth, multiple MRs/R_keys are
required to register a single 1MB NFS READ or WRITE. Here's where
squelching remote invalidation really pays off.

Sure, but such a bandwidth-dominated workload isn't very interesting
performance-wise. With 1MB ops I would expect you to be wire-limited,
right?

I think it would be best to capture some or all of this
explanation in the commit message, in any case.

You mean you want my patch description to explain _why_ ? ;-)

Sort of. My belief is that this decision represents a micro-optimization
and is unlikely to be forever true. More significantly, it's not a
bug fix or correctness issue. So, capturing the reasoning behind
it is useful for the future, in case someone thinks to unwind it.

Tom.