Re: [PATCH v3 05/11] xprtrdma: Do not wait if ib_post_send() fails

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 10 Mar 2016 10:05:54 -0500

> On Mar 10, 2016, at 10:04 AM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
>>>> Moving the QP into error state right after with rdma_disconnect
>>>> you are not sure that none of the subset of the invalidations
>>>> that _were_ posted completed and you get the corresponding MRs
>>>> in a bogus state...
>>> 
>>> Moving the QP to error state and then draining the CQs means
>>> that all LOCAL_INV WRs that managed to get posted will get
>>> completed or flushed. That's already handled today.
>>> 
>>> It's the WRs that didn't get posted that I'm worried about
>>> in this patch.
>>> 
>>> Are there RDMA consumers in the kernel that use that third
>>> argument to recover when LOCAL_INV WRs cannot be posted?
>> 
>> None :)
>> 
>>>>> I suppose I could reset these MRs instead (that is,
>>>>> pass them to ib_dereg_mr).
>>>> 
>>>> Or, just wait for a completion for those that were posted
>>>> and then all the MRs are in a consistent state.
>>> 
>>> When a LOCAL_INV completes with IB_WC_SUCCESS, the associated
>>> MR is in a known state (ie, invalid).
>>> 
>>> The WRs that flush mean the associated MRs are not in a known
>>> state. Sometimes the MR state is different than the hardware
>>> state, for example. Trying to do anything with one of these
>>> inconsistent MRs results in IB_WC_BIND_MW_ERR until the thing
>>> is deregistered.
>> 
>> Correct.
>> 
> 
> It is legal to invalidate an MR that is not in the valid state.  So you don't
> have to deregister it, you can assume it is valid and post another LINV WR.

I've tried that. Once the MR is inconsistent, even LOCAL_INV
does not work.

There's no way to tell whether the MR is consistent or not
after a connection loss, so the only recourse is to
deregister (and reregister) the MR when LOCAL_INV is
flushed.

> 
>>> The xprtrdma completion handlers mark the MR associated with
>>> a flushed LOCAL_INV WR "stale". They all have to be reset with
>>> ib_dereg_mr to guarantee they are usable again. Have a look at
>>> __frwr_recovery_worker().
>> 
>> Yes, I'm aware of that.
>> 
>>> And, xprtrdma waits for only the last LOCAL_INV in the chain to
>>> complete. If that one isn't posted, then fr_done is never woken
>>> up. In that case, frwr_op_unmap_sync() would wait forever.
>> 
>> Ah.. so the (missing) completions is the problem, now I get
>> it.
>> 
>>> If I understand you I think the correct solution is for
>>> frwr_op_unmap_sync() to regroup and reset the MRs associated
>>> with the LOCAL_INV WRs that were never posted, using the same
>>> mechanism as __frwr_recovery_worker() .
>> 
>> Yea, I'd recycle all the MRs instead of having non-trivial logic
>> to try and figure out MR states...
>> 
>>> It's already 4.5-rc7, a little late for a significant rework
>>> of this patch, so maybe I should drop it?
>> 
>> Perhaps... Although you can make it incremental because the current
>> patch doesn't seem to break anything, just not solving the complete
>> problem...
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html