Re: problem on nfsd doing RDMA write

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 31 Jan 2018 16:52:31 -0500

> On Jan 31, 2018, at 4:42 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> 
> On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>>> On Jan 31, 2018, at 2:51 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>> 
>>>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>> 
>>>>> 
>>>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To
>>>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero
>>>>>>>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to
>>>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation
>>>>>>>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem happens for the values where XDR round up ends up rounding
>>>>>>>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to
>>>>>>>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> len = XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>>>>>>> if (len >= avail) {
>>>>>>>>>>>>>> -               int pages;
>>>>>>>>>>>>>> +               int pages = 0;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>         len -= avail;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>>> +               if (write->wr_buflen >= PAGE_SIZE)
>>>>>>>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>>>         argp->pagelist += pages;
>>>>>>>>>>>>>>         argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>>>>>>         len -= pages * PAGE_SIZE;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads
>>>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes
>>>>>>>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading
>>>>>>>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Would this also be a problem near any page boundary, say, a
>>>>>>>>>>>>> write length of 8191 bytes?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>> +               pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>>>>> 
>>>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way.
>>>>>>>>>>>> 
>>>>>>>>>>>>> And be sure to test with TCP as well.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why.
>>>>>>>>>>> 
>>>>>>>>>>> OK.
>>>>>>>>>>> 
>>>>>>>>>>> Remember that a Read chunk's length does not have to be
>>>>>>>>>>> rounded up. Maybe the transport needs to round up the
>>>>>>>>>>> length of the unmarshaled data content on behalf of the
>>>>>>>>>>> NFSv4 write decoder.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The problem of simply taking write->wr_buflen was that len before that
>>>>>>>>>> could have been adjusted by avail value in then non-RDAM mounts.
>>>>>>>>>> 
>>>>>>>>>> Again, I'm not sure if this is the right fix. But this one works for
>>>>>>>>>> both non-RDMA and RDMA mounts.
>>>>>>>>>> 
>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>> index 2c61c6b..3178997 100644
>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct
>>>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>>>>>>> 
>>>>>>>>>>           len -= avail;
>>>>>>>>>> 
>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>> +               if (!avail)
>>>>>>>>>> +                       pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>>> +               else
>>>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>>>           argp->pagelist += pages;
>>>>>>>>>>           argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>>           len -= pages * PAGE_SIZE;
>>>>>>>>> 
>>>>>>>>> This code has been around since 2012. Has it always been
>>>>>>>>> broken for RDMA? I suspect the root problem is in the
>>>>>>>>> transport, not here in the decoder. Can you bisect to
>>>>>>>>> find out where the problem started to occur?
>>>>>>>> 
>>>>>>>> To see if I could go back to 2012, I first want to see if I could go
>>>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm
>>>>>>>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm
>>>>>>>> trying to verify if my CX-5 card would even work with mlx4 driver
>>>>>>>> which I'm having doubts it would.
>>>>>>> 
>>>>>>> The CX-5 works only with mlx5.
>>>>>>> 
>>>>>>> 
>>>>>>>> These old kernels fail to boot in
>>>>>>>> XFS saying that it's a wrong version. Some googling makes me think
>>>>>>>> that since my XFS partition was created with a newer kernel it's not
>>>>>>>> compatible with an older kernel...
>>>>>>> 
>>>>>>> That's correct, the XFS filesystem has new features like
>>>>>>> metadata checksum that don't work on older kernels.
>>>>>>> 
>>>>>>> To start with, you could try some very late 3.x or early
>>>>>>> 4.x kernels. My intuition is that you will find an svcrdma
>>>>>>> change that is more recent than 2012 that is causing the
>>>>>>> underlying issue.
>>>>>> 
>>>>>> It works in 4.8 and does not work in 4.9. It's this commit that breaks it:
>>>>>> 
>>>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee
>>>>>> 
>>>>>> svcrdma: Server-side support for rpcrdma_connect_private
>>>>>> 
>>>>>> Prepare to receive an RDMA-CM private message when handling a new
>>>>>> connection attempt, and send a similar message as part of connection
>>>>>> acceptance.
>>>>>> 
>>>>>> Both sides can communicate their various implementation limits.
>>>>>> Implementations that don't support this sideband protocol ignore it.
>>>>>> 
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>>> Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
>>>>>> Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
>>>>>> 
>>>>>> So far I don't understand how can causes problems...
>>>>> 
>>>>> More likely it's:
>>>>> 
>>>>> 
>>>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564
>>>>> Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400
>>>>> Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
>>>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>>>> 
>>>>>  nfsd: Incoming xdr_bufs may have content in tail buffer
>>>>> 
>>>>> 
>>>>> 
>>>>> This is a 3 op compound. The third op is GETATTR.
>>>>> The reply shows that the WRITE op succeeds, which
>>>>> is why no error is reported to the application.
>>>>> The GETATTR fails with OP_ILLEGAL, and the client
>>>>> I guess is trained to ignore that kind of failure.
>>>>> 
>>>>> After nfsd4_decode_write pushes into the "next" page
>>>>> in the page list, the next time READ_BUF is called
>>>>> (to handle the next op in the compound), it is
>>>>> supposed to see that the XDR page list is now
>>>>> exhausted, and switch to the XDR tail buf.
>>>>> 
>>>>> TCP doesn't use the tail buf; the GETATTR in that
>>>>> case is at the end of the page list, following the
>>>>> WRITE payload. That would explain why this issue
>>>>> cannot be reproduced there.
>>>>> 
>>>>> RDMA does use the tail buf in this case. For some
>>>>> reason the logic added in eae03e2ac8 is not working,
>>>>> and the server is looking at uninitialized memory
>>>>> to find that third op.
>>>>> 
>>>>> I run cthon04 all the time, so I wonder why I never
>>>>> hit this case.
>>>> 
>>>> While this commit seems like a better explanation,  I can 100% hit the
>>>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9
>>>> branch and rollback back to the cc9d83408b) . What matters is the
>>>> assignment of the "private_data" in the conn_param. If I set it to
>>>> NULL and 0 for the length. The problem goes away.
>>> 
>>> There's no obvious connection between the NFSv4 WRITE
>>> misbehavior and CM private data being present.
>> 
>> You observed that this logic in nfsd4_decode_write is
>> making argp->pagelen go negative:
>> 
>> 1304                 pages = len >> PAGE_SHIFT;
>> 1305                 argp->pagelist += pages;
>> 1306                 argp->pagelen -= pages * PAGE_SIZE;
>> 1307                 len -= pages * PAGE_SIZE;
>> 
>> This is because the client does not insert XDR round-up
>> in the Read chunk. Since pagelen is not zero, the code
>> in read_buf that is looking for that case (added by
>> eae03e2ac80a) fails.
>> 
>> I suspect what's going on here is:
>> 
>> When you disable the private_data logic, the server is
>> not "receiving" the CM private data, and thus it assumes
>> the client is not Linux. In response, the server does not
>> provide CM private data in it's "connection accept"
>> response.
>> 
>> The client then assumes the _server_ is not Linux, and
>> disables an optimization that removes the XDR round-up
>> from Read chunks.
>> 
>> So the client adds that round-up padding to the end of
>> the Read chunks, and that prevents the bug. pagelen
>> never goes below zero.
>> 
>> Basically nfsd4_decode_write() assumes that the XDR
>> round-up for the WRITE payload is always going to be
>> present. The round-up is optional for Read chunks,
>> according to RFC 8166.
>> 
>> 193bcb7b37 adds that round-up in the tail iovec. It looks
>> like that won't be adequate in the case where the payload
>> approaches a page boundary.
>> 
>> Does this seem plausible?
> 
> Yes your explanation regarding the presence of CM private data that
> toggles what client sends is exactly what I'm seeing.
> 
> I don't understand the last paragraph. But it sounds like server needs
> to check that if client is linux then it needs to do a round up (and
> also actually add padding which would have been done on the client)
> because of the nfsd4_decode_write()'s assumptions.

The server-side transport can _always_ round-up the length
of a Read chunk. If the client included the round-up padding,
the length won't change. If the client didn't include padding,
then the length will be bumped up properly. The XDR decoders
should then be happy either way. I hope.

What I'm trying now is having svc_rdma_build_normal_read_chunk
round-up rq_arg->pagelen instead of rounding up the length of
the XDR tail iovec based on the length of the Read chunk.
Seems obvious, and I'm wondering why I didn't do it this way
in the first place. The comment there suggests the NFSv3 decoder
might be finicky about the XDR lengths.

>>>> Are you hitting the issue now?
>>> 
>>> I can reproduce it 100% of the time using your dd example
>>> from a Linux NFS client.
>>> 
>>> I've also done a 8009 byte write on TCP, which forces the
>>> same page boundary edge condition in nfsd4_decode_write,
>>> and the failure does not occur.
>>> 
>>> 
>>>> Are you 100% certain that you haven't
>>>> hit it before. Do you always inspect the network traces? Without
>>>> inspecting the network trace you wouldn't have found it. cthon never
>>>> fails so from the user perspective nothing failed.
>>> 
>>> Fair enough.
>>> 
>>> 
>>>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see
>>>> what different do I see in my testing.
>>> 
>>> My theory is it's that one or this related one:
>>> 
>>> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64
>>> Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> AuthorDate: Fri Aug 18 11:12:35 2017 -0400
>>> Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>> 
>>>   svcrdma: Populate tail iovec when receiving
>>> 
>>> 
>>>>>>>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Chuck Lever
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Chuck Lever
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> 
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html