RE: NFSoRDMA Fails for max_sge Less Than 18

"Steve Wise" <swise@xxxxxxxxxxxxxxxxxxxxx> · Wed, 11 Jan 2017 13:53:40 -0600

> Hi Steve-
> 
> 
> > On Jan 11, 2017, at 12:04 PM, Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx>
> wrote:
> >
> > Hey Chuck,
> >
> >>>
> >>> Browsing the code of other drivers it can be seen that this ability is
> > either
> >> hardcoded or is
> >>> learnt by the driver from the device.
> >>
> >> In the latter case, there's no way for me to know what that
> >> capability is by looking at kernel code. There's also no way
> >> for me to know about out-of-tree drivers or pre-release devices.
> >
> > But shouldn't NFS always limit its sge depths based on
ib_device_attr->max_sge?
> > I don't think it is reasonable to assume any minimum value supported by all
> > devices...
> 
> The previous "minimum" requirement assumed 2 SGEs. RPC-over-RDMA just
> doesn't work with fewer than that.
> 
> So you would rather have RPC-over-RDMA automatically reduce the inline
> threshold setting for each connection if the current system setting
> cannot be supported by the device? I'll consider that.
> 

That seems reasonable.

> Keeping logic in the ULPs to handle devices with tiny capabilities
> duplicates a lot of complexity and impedes the introduction of new
> features like larger inline thresholds.
> 
> The one abstract feature ULPs might really want is the ability to
> send medium-sized data payloads in place. More than a handful of
> SGEs is needed for that capability (in the kernel, where such payloads
> are typically in the page cache).
> 
> It might be cool to have an API similar to rdma_rw that allows ULPs
> to use a scatterlist for Send and Receive operations. It could hide
> the driver and device maximum SGE values.
>

I'm not sure what you mean by "in place"?  (sorry for being not up to speed on
this whole issue)  But perhaps some API like this could be added to rdma_rw...

> 
> >> It's not feasible for me to stock my lab with more than a
> >> couple of devices anyway.
> >>
> >> For all these reasons, I rely on HCA vendors for smoke testing
> >> NFS/RDMA with their devices.
> >>
> >> [1] was posted for review on public mailing lists for weeks. I
> >> received no review comments or reports of testing successes or
> >> failures from any vendor, until Broadcom's report in late
> >> December, three months after [1] appeared in a kernel release
> >> candidate.
> >>
> >> This may sound like sour grapes, but this is a review and
> >> testing gap, and I think the community should have the ability
> >> to address it.
> >>
> >> HCA vendors, especially, have to focus on kernel release
> >> candidate testing if functional ULPs are a critical release
> >> criterion for them.
> >>
> >
> > You're absolutely right.  I'm querying Chelsio to see how this might have
> > slipped through the cracks.  Did this initial change land in linux-4.9?
> 
> I believe so.
> 
> 
> > I have one nit though, your patch series are always very long and thus, to
me,
> > tedious to review.  It would be nice to see 5-8 patches submitted for review
vs
> > 15+.
> 
> I cap my patch series around 20 for just this reason. That
> seemed to be the average number being posted for other ULPs.
> 
> The flip side is that sometimes it takes several quarters to
> get a full set of changes upstream. Splitting features across
> kernel releases means the feature can't be reviewed together,
> and is sometimes more difficult for distribution backports.
> 
> Could also go with more smaller patches, where each patch is
> easier to review, or capping at 8 patches but each patch
> is more complex.
> 
> It's also OK to suggest series reorganization whenever you
> feel the ennui. ;-)
> 
> 
> >>> If I'm not mistaken, this issue affects nes and
> >>> cxgb3/4 drivers, and perhaps others.
> >>
> >> ocrdma and Oracle's HCA.
> >>
> >>
> >>> E.g., for cxgb4:
> >>>
> >>>       #define T4_MAX_RECV_SGE 4
> >>
> >> Yet, without hard-coded max_sge values in kernel drivers, it's
> >> difficult to say whether 4 is truly the lower bound.
> >>
> >>
> >>>       static int c4iw_query_device(struct ib_device *ibdev, struct
> > ib_device_attr
> >> *props,
> >>>                                    struct ib_udata *uhw)
> >>>       {
> >>>               ...
> >>>               props->max_sge = T4_MAX_RECV_SGE;
> >>>
> >>> ***
> >
> > FYI:  cxgb4 supports 4 max for recv wrs, and 17 max for send wrs.  Perhaps
17
> > avoided any problems for cxgb4 with the original code?
> 
> The original code needed only two SGEs for sending, and one for
> receiving.
> 
> IIRC the RPC-over-RDMA receive path still needs just one SGE.
> 

No I mean the code that bumps it up to 18.  Would that cause an immediate
failure if cxgb4 supported 17 and only enforces it at post_send() time?
(haven't looked in detail at your patches...sorry).  Our QA ran testing on 4.9
and didn't see this issue, so that's why I'm asking.    They have not yet run
NFS/RDMA testing on 4.9-rc.  I've asked them to do a quick regression test asap.

> 
> > Note: the ib_device_attr only has a max_sge that pertains to both send and
recv,
> > so cxgb4 sets it to the min value.  We should probably add a max_recv_sge
and
> > max_send_sge to ib_device_attr...
> 
> I could go for that too.
>

I'm swamped right now to add this, but the changes should be trivial...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html