Re: problem on nfsd doing RDMA write

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
>
>> On Jan 31, 2018, at 2:51 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>
>>
>>
>>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>
>>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>
>>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>
>>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To
>>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero
>>>>>>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to
>>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation
>>>>>>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem happens for the values where XDR round up ends up rounding
>>>>>>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to
>>>>>>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>>>>>>
>>>>>>>>>>>>>  len = XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>>>>>>  if (len >= avail) {
>>>>>>>>>>>>> -               int pages;
>>>>>>>>>>>>> +               int pages = 0;
>>>>>>>>>>>>>
>>>>>>>>>>>>>          len -= avail;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>> +               if (write->wr_buflen >= PAGE_SIZE)
>>>>>>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>>          argp->pagelist += pages;
>>>>>>>>>>>>>          argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>>>>>          len -= pages * PAGE_SIZE;
>>>>>>>>>>>>>
>>>>>>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads
>>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes
>>>>>>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading
>>>>>>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>>>>>>
>>>>>>>>>>>> Would this also be a problem near any page boundary, say, a
>>>>>>>>>>>> write length of 8191 bytes?
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>>>>>>
>>>>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>>>>> +               pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>>>>
>>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way.
>>>>>>>>>>>
>>>>>>>>>>>> And be sure to test with TCP as well.
>>>>>>>>>>>
>>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why.
>>>>>>>>>>
>>>>>>>>>> OK.
>>>>>>>>>>
>>>>>>>>>> Remember that a Read chunk's length does not have to be
>>>>>>>>>> rounded up. Maybe the transport needs to round up the
>>>>>>>>>> length of the unmarshaled data content on behalf of the
>>>>>>>>>> NFSv4 write decoder.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem of simply taking write->wr_buflen was that len before that
>>>>>>>>> could have been adjusted by avail value in then non-RDAM mounts.
>>>>>>>>>
>>>>>>>>> Again, I'm not sure if this is the right fix. But this one works for
>>>>>>>>> both non-RDMA and RDMA mounts.
>>>>>>>>>
>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>> index 2c61c6b..3178997 100644
>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct
>>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>>>>>>
>>>>>>>>>            len -= avail;
>>>>>>>>>
>>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>>> +               if (!avail)
>>>>>>>>> +                       pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>> +               else
>>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>>            argp->pagelist += pages;
>>>>>>>>>            argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>            len -= pages * PAGE_SIZE;
>>>>>>>>
>>>>>>>> This code has been around since 2012. Has it always been
>>>>>>>> broken for RDMA? I suspect the root problem is in the
>>>>>>>> transport, not here in the decoder. Can you bisect to
>>>>>>>> find out where the problem started to occur?
>>>>>>>
>>>>>>> To see if I could go back to 2012, I first want to see if I could go
>>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm
>>>>>>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm
>>>>>>> trying to verify if my CX-5 card would even work with mlx4 driver
>>>>>>> which I'm having doubts it would.
>>>>>>
>>>>>> The CX-5 works only with mlx5.
>>>>>>
>>>>>>
>>>>>>> These old kernels fail to boot in
>>>>>>> XFS saying that it's a wrong version. Some googling makes me think
>>>>>>> that since my XFS partition was created with a newer kernel it's not
>>>>>>> compatible with an older kernel...
>>>>>>
>>>>>> That's correct, the XFS filesystem has new features like
>>>>>> metadata checksum that don't work on older kernels.
>>>>>>
>>>>>> To start with, you could try some very late 3.x or early
>>>>>> 4.x kernels. My intuition is that you will find an svcrdma
>>>>>> change that is more recent than 2012 that is causing the
>>>>>> underlying issue.
>>>>>
>>>>> It works in 4.8 and does not work in 4.9. It's this commit that breaks it:
>>>>>
>>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee
>>>>>
>>>>> svcrdma: Server-side support for rpcrdma_connect_private
>>>>>
>>>>> Prepare to receive an RDMA-CM private message when handling a new
>>>>> connection attempt, and send a similar message as part of connection
>>>>> acceptance.
>>>>>
>>>>> Both sides can communicate their various implementation limits.
>>>>> Implementations that don't support this sideband protocol ignore it.
>>>>>
>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>>> Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
>>>>> Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>
>>>>>
>>>>> So far I don't understand how can causes problems...
>>>>
>>>> More likely it's:
>>>>
>>>>
>>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564
>>>> Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400
>>>> Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
>>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>>>
>>>>   nfsd: Incoming xdr_bufs may have content in tail buffer
>>>>
>>>>
>>>>
>>>> This is a 3 op compound. The third op is GETATTR.
>>>> The reply shows that the WRITE op succeeds, which
>>>> is why no error is reported to the application.
>>>> The GETATTR fails with OP_ILLEGAL, and the client
>>>> I guess is trained to ignore that kind of failure.
>>>>
>>>> After nfsd4_decode_write pushes into the "next" page
>>>> in the page list, the next time READ_BUF is called
>>>> (to handle the next op in the compound), it is
>>>> supposed to see that the XDR page list is now
>>>> exhausted, and switch to the XDR tail buf.
>>>>
>>>> TCP doesn't use the tail buf; the GETATTR in that
>>>> case is at the end of the page list, following the
>>>> WRITE payload. That would explain why this issue
>>>> cannot be reproduced there.
>>>>
>>>> RDMA does use the tail buf in this case. For some
>>>> reason the logic added in eae03e2ac8 is not working,
>>>> and the server is looking at uninitialized memory
>>>> to find that third op.
>>>>
>>>> I run cthon04 all the time, so I wonder why I never
>>>> hit this case.
>>>
>>> While this commit seems like a better explanation,  I can 100% hit the
>>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9
>>> branch and rollback back to the cc9d83408b) . What matters is the
>>> assignment of the "private_data" in the conn_param. If I set it to
>>> NULL and 0 for the length. The problem goes away.
>>
>> There's no obvious connection between the NFSv4 WRITE
>> misbehavior and CM private data being present.
>
> You observed that this logic in nfsd4_decode_write is
> making argp->pagelen go negative:
>
> 1304                 pages = len >> PAGE_SHIFT;
> 1305                 argp->pagelist += pages;
> 1306                 argp->pagelen -= pages * PAGE_SIZE;
> 1307                 len -= pages * PAGE_SIZE;
>
> This is because the client does not insert XDR round-up
> in the Read chunk. Since pagelen is not zero, the code
> in read_buf that is looking for that case (added by
> eae03e2ac80a) fails.
>
> I suspect what's going on here is:
>
> When you disable the private_data logic, the server is
> not "receiving" the CM private data, and thus it assumes
> the client is not Linux. In response, the server does not
> provide CM private data in it's "connection accept"
> response.
>
> The client then assumes the _server_ is not Linux, and
> disables an optimization that removes the XDR round-up
> from Read chunks.
>
> So the client adds that round-up padding to the end of
> the Read chunks, and that prevents the bug. pagelen
> never goes below zero.
>
> Basically nfsd4_decode_write() assumes that the XDR
> round-up for the WRITE payload is always going to be
> present. The round-up is optional for Read chunks,
> according to RFC 8166.
>
> 193bcb7b37 adds that round-up in the tail iovec. It looks
> like that won't be adequate in the case where the payload
> approaches a page boundary.
>
> Does this seem plausible?

Yes your explanation regarding the presence of CM private data that
toggles what client sends is exactly what I'm seeing.

I don't understand the last paragraph. But it sounds like server needs
to check that if client is linux then it needs to do a round up (and
also actually add padding which would have been done on the client)
because of the nfsd4_decode_write()'s assumptions.

>
>
>>> Are you hitting the issue now?
>>
>> I can reproduce it 100% of the time using your dd example
>> from a Linux NFS client.
>>
>> I've also done a 8009 byte write on TCP, which forces the
>> same page boundary edge condition in nfsd4_decode_write,
>> and the failure does not occur.
>>
>>
>>> Are you 100% certain that you haven't
>>> hit it before. Do you always inspect the network traces? Without
>>> inspecting the network trace you wouldn't have found it. cthon never
>>> fails so from the user perspective nothing failed.
>>
>> Fair enough.
>>
>>
>>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see
>>> what different do I see in my testing.
>>
>> My theory is it's that one or this related one:
>>
>> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64
>> Author:     Chuck Lever <chuck.lever@xxxxxxxxxx>
>> AuthorDate: Fri Aug 18 11:12:35 2017 -0400
>> Commit:     J. Bruce Fields <bfields@xxxxxxxxxx>
>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>
>>    svcrdma: Populate tail iovec when receiving
>>
>>
>>>>>>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Chuck Lever
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>> --
>>>>>>>> Chuck Lever
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> Chuck Lever
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux