Re: NFSoRDMA Fails for max_sge Less Than 18

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 11 Jan 2017 15:09:28 -0500

> On Jan 11, 2017, at 2:53 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
>> Hi Steve-
>> 
>> 
>>> On Jan 11, 2017, at 12:04 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx>
>> wrote:
>>> 
>>> Hey Chuck,
>>> 
>>>>> 
>>>>> Browsing the code of other drivers it can be seen that this ability is
>>> either
>>>> hardcoded or is
>>>>> learnt by the driver from the device.
>>>> 
>>>> In the latter case, there's no way for me to know what that
>>>> capability is by looking at kernel code. There's also no way
>>>> for me to know about out-of-tree drivers or pre-release devices.
>>> 
>>> But shouldn't NFS always limit its sge depths based on
> ib_device_attr->max_sge?
>>> I don't think it is reasonable to assume any minimum value supported by all
>>> devices...
>> 
>> The previous "minimum" requirement assumed 2 SGEs. RPC-over-RDMA just
>> doesn't work with fewer than that.
>> 
>> So you would rather have RPC-over-RDMA automatically reduce the inline
>> threshold setting for each connection if the current system setting
>> cannot be supported by the device? I'll consider that.
>> 
> 
> That seems reasonable.
> 
>> Keeping logic in the ULPs to handle devices with tiny capabilities
>> duplicates a lot of complexity and impedes the introduction of new
>> features like larger inline thresholds.
>> 
>> The one abstract feature ULPs might really want is the ability to
>> send medium-sized data payloads in place. More than a handful of
>> SGEs is needed for that capability (in the kernel, where such payloads
>> are typically in the page cache).
>> 
>> It might be cool to have an API similar to rdma_rw that allows ULPs
>> to use a scatterlist for Send and Receive operations. It could hide
>> the driver and device maximum SGE values.
>> 
> 
> I'm not sure what you mean by "in place"?  (sorry for being not up to speed on
> this whole issue)  But perhaps some API like this could be added to rdma_rw...

"in place" == the SGE array would point to struct pages
containing parts of the message payload. That's basically what
this "support large inline threshold" patch is doing.

If the device supports only 4 SGEs, then the largest
message size that can be sent this way is just one or two
pages.

Some would prefer to send much larger payloads this way.

I guess what I'm asking is whether 4 SGEs is going to be typical
of HCAs going forward, or whether there is a definite trend for
adding more in new device designs.

>>>> It's not feasible for me to stock my lab with more than a
>>>> couple of devices anyway.
>>>> 
>>>> For all these reasons, I rely on HCA vendors for smoke testing
>>>> NFS/RDMA with their devices.
>>>> 
>>>> [1] was posted for review on public mailing lists for weeks. I
>>>> received no review comments or reports of testing successes or
>>>> failures from any vendor, until Broadcom's report in late
>>>> December, three months after [1] appeared in a kernel release
>>>> candidate.
>>>> 
>>>> This may sound like sour grapes, but this is a review and
>>>> testing gap, and I think the community should have the ability
>>>> to address it.
>>>> 
>>>> HCA vendors, especially, have to focus on kernel release
>>>> candidate testing if functional ULPs are a critical release
>>>> criterion for them.
>>>> 
>>> 
>>> You're absolutely right.  I'm querying Chelsio to see how this might have
>>> slipped through the cracks.  Did this initial change land in linux-4.9?
>> 
>> I believe so.
>> 
>> 
>>> I have one nit though, your patch series are always very long and thus, to
> me,
>>> tedious to review.  It would be nice to see 5-8 patches submitted for review
> vs
>>> 15+.
>> 
>> I cap my patch series around 20 for just this reason. That
>> seemed to be the average number being posted for other ULPs.
>> 
>> The flip side is that sometimes it takes several quarters to
>> get a full set of changes upstream. Splitting features across
>> kernel releases means the feature can't be reviewed together,
>> and is sometimes more difficult for distribution backports.
>> 
>> Could also go with more smaller patches, where each patch is
>> easier to review, or capping at 8 patches but each patch
>> is more complex.
>> 
>> It's also OK to suggest series reorganization whenever you
>> feel the ennui. ;-)
>> 
>> 
>>>>> If I'm not mistaken, this issue affects nes and
>>>>> cxgb3/4 drivers, and perhaps others.
>>>> 
>>>> ocrdma and Oracle's HCA.
>>>> 
>>>> 
>>>>> E.g., for cxgb4:
>>>>> 
>>>>>      #define T4_MAX_RECV_SGE 4
>>>> 
>>>> Yet, without hard-coded max_sge values in kernel drivers, it's
>>>> difficult to say whether 4 is truly the lower bound.
>>>> 
>>>> 
>>>>>      static int c4iw_query_device(struct ib_device *ibdev, struct
>>> ib_device_attr
>>>> *props,
>>>>>                                   struct ib_udata *uhw)
>>>>>      {
>>>>>              ...
>>>>>              props->max_sge = T4_MAX_RECV_SGE;
>>>>> 
>>>>> ***
>>> 
>>> FYI:  cxgb4 supports 4 max for recv wrs, and 17 max for send wrs.  Perhaps
> 17
>>> avoided any problems for cxgb4 with the original code?
>> 
>> The original code needed only two SGEs for sending, and one for
>> receiving.
>> 
>> IIRC the RPC-over-RDMA receive path still needs just one SGE.
>> 
> 
> No I mean the code that bumps it up to 18.  Would that cause an immediate
> failure if cxgb4 supported 17 and only enforces it at post_send() time?

"mount" would fail immediately if the driver reported max_sge == 17.
The check that Ram mentioned happens at mount time, before anything
has been sent.

> (haven't looked in detail at your patches...sorry).  Our QA ran testing on 4.9
> and didn't see this issue, so that's why I'm asking.    They have not yet run
> NFS/RDMA testing on 4.9-rc.  I've asked them to do a quick regression test asap.

That's curious!

>>> Note: the ib_device_attr only has a max_sge that pertains to both send and
> recv,
>>> so cxgb4 sets it to the min value.  We should probably add a max_recv_sge
> and
>>> max_send_sge to ib_device_attr...
>> 
>> I could go for that too.
>> 
> 
> I'm swamped right now to add this, but the changes should be trivial...

Maybe I could get to it, but no promises.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html