Re: problem on nfsd doing RDMA write

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 31 Jan 2018 14:51:27 -0500

> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> 
> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>> 
>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>> 
>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>> 
>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>>>> 
>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To
>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero
>>>>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to
>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation
>>>>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>>>> 
>>>>>>>>>>> The problem happens for the values where XDR round up ends up rounding
>>>>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to
>>>>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>>>> 
>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>>>> 
>>>>>>>>>>>   len = XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>>>>   if (len >= avail) {
>>>>>>>>>>> -               int pages;
>>>>>>>>>>> +               int pages = 0;
>>>>>>>>>>> 
>>>>>>>>>>>           len -= avail;
>>>>>>>>>>> 
>>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>>> +               if (write->wr_buflen >= PAGE_SIZE)
>>>>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>>>>           argp->pagelist += pages;
>>>>>>>>>>>           argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>>>           len -= pages * PAGE_SIZE;
>>>>>>>>>>> 
>>>>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads
>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes
>>>>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading
>>>>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>>>> 
>>>>>>>>>> Would this also be a problem near any page boundary, say, a
>>>>>>>>>> write length of 8191 bytes?
>>>>>>>>>> 
>>>>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>>>> 
>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>> +               pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>> 
>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way.
>>>>>>>>> 
>>>>>>>>>> And be sure to test with TCP as well.
>>>>>>>>> 
>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why.
>>>>>>>> 
>>>>>>>> OK.
>>>>>>>> 
>>>>>>>> Remember that a Read chunk's length does not have to be
>>>>>>>> rounded up. Maybe the transport needs to round up the
>>>>>>>> length of the unmarshaled data content on behalf of the
>>>>>>>> NFSv4 write decoder.
>>>>>>>> 
>>>>>>> 
>>>>>>> The problem of simply taking write->wr_buflen was that len before that
>>>>>>> could have been adjusted by avail value in then non-RDAM mounts.
>>>>>>> 
>>>>>>> Again, I'm not sure if this is the right fix. But this one works for
>>>>>>> both non-RDMA and RDMA mounts.
>>>>>>> 
>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>> index 2c61c6b..3178997 100644
>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct
>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>>>> 
>>>>>>>             len -= avail;
>>>>>>> 
>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>> +               if (!avail)
>>>>>>> +                       pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>> +               else
>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>             argp->pagelist += pages;
>>>>>>>             argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>             len -= pages * PAGE_SIZE;
>>>>>> 
>>>>>> This code has been around since 2012. Has it always been
>>>>>> broken for RDMA? I suspect the root problem is in the
>>>>>> transport, not here in the decoder. Can you bisect to
>>>>>> find out where the problem started to occur?
>>>>> 
>>>>> To see if I could go back to 2012, I first want to see if I could go
>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm
>>>>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm
>>>>> trying to verify if my CX-5 card would even work with mlx4 driver
>>>>> which I'm having doubts it would.
>>>> 
>>>> The CX-5 works only with mlx5.
>>>> 
>>>> 
>>>>> These old kernels fail to boot in
>>>>> XFS saying that it's a wrong version. Some googling makes me think
>>>>> that since my XFS partition was created with a newer kernel it's not
>>>>> compatible with an older kernel...
>>>> 
>>>> That's correct, the XFS filesystem has new features like
>>>> metadata checksum that don't work on older kernels.
>>>> 
>>>> To start with, you could try some very late 3.x or early
>>>> 4.x kernels. My intuition is that you will find an svcrdma
>>>> change that is more recent than 2012 that is causing the
>>>> underlying issue.
>>> 
>>> It works in 4.8 and does not work in 4.9. It's this commit that breaks it:
>>> 
>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee
>>> 
>>> svcrdma: Server-side support for rpcrdma_connect_private
>>> 
>>> Prepare to receive an RDMA-CM private message when handling a new
>>> connection attempt, and send a similar message as part of connection
>>> acceptance.
>>> 
>>> Both sides can communicate their various implementation limits.
>>> Implementations that don't support this sideband protocol ignore it.
>>> 
>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>> Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
>>> Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
>>> 
>>> So far I don't understand how can causes problems...
>> 
>> More likely it's:
>> 
>> 
>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564
>> Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400
>> Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>> 
>>    nfsd: Incoming xdr_bufs may have content in tail buffer
>> 
>> 
>> 
>> This is a 3 op compound. The third op is GETATTR.
>> The reply shows that the WRITE op succeeds, which
>> is why no error is reported to the application.
>> The GETATTR fails with OP_ILLEGAL, and the client
>> I guess is trained to ignore that kind of failure.
>> 
>> After nfsd4_decode_write pushes into the "next" page
>> in the page list, the next time READ_BUF is called
>> (to handle the next op in the compound), it is
>> supposed to see that the XDR page list is now
>> exhausted, and switch to the XDR tail buf.
>> 
>> TCP doesn't use the tail buf; the GETATTR in that
>> case is at the end of the page list, following the
>> WRITE payload. That would explain why this issue
>> cannot be reproduced there.
>> 
>> RDMA does use the tail buf in this case. For some
>> reason the logic added in eae03e2ac8 is not working,
>> and the server is looking at uninitialized memory
>> to find that third op.
>> 
>> I run cthon04 all the time, so I wonder why I never
>> hit this case.
> 
> While this commit seems like a better explanation,  I can 100% hit the
> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9
> branch and rollback back to the cc9d83408b) . What matters is the
> assignment of the "private_data" in the conn_param. If I set it to
> NULL and 0 for the length. The problem goes away.

There's no obvious connection between the NFSv4 WRITE
misbehavior and CM private data being present.

> Are you hitting the issue now?

I can reproduce it 100% of the time using your dd example
from a Linux NFS client.

I've also done a 8009 byte write on TCP, which forces the
same page boundary edge condition in nfsd4_decode_write,
and the failure does not occur.

> Are you 100% certain that you haven't
> hit it before. Do you always inspect the network traces? Without
> inspecting the network trace you wouldn't have found it. cthon never
> fails so from the user perspective nothing failed.

Fair enough.

> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see
> what different do I see in my testing.

My theory is it's that one or this related one:

commit 193bcb7b3719a315814dce40fc8dcd1af786ea64
Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
AuthorDate: Fri Aug 18 11:12:35 2017 -0400
Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
CommitDate: Tue Sep 5 15:15:29 2017 -0400

    svcrdma: Populate tail iovec when receiving

>>>>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Chuck Lever
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Chuck Lever
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> --
>>>>>> Chuck Lever
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> Chuck Lever
>>>> 
>>>> 
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html