Re: problem on nfsd doing RDMA write

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
>
>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>
>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>
>>>
>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>
>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>
>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To
>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero
>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to
>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation
>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>
>>>>>>>> The problem happens for the values where XDR round up ends up rounding
>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to
>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>
>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>
>>>>>>>>     len = XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>     if (len >= avail) {
>>>>>>>> -               int pages;
>>>>>>>> +               int pages = 0;
>>>>>>>>
>>>>>>>>             len -= avail;
>>>>>>>>
>>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>>> +               if (write->wr_buflen >= PAGE_SIZE)
>>>>>>>> +                       pages = len >> PAGE_SHIFT;
>>>>>>>>             argp->pagelist += pages;
>>>>>>>>             argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>             len -= pages * PAGE_SIZE;
>>>>>>>>
>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads
>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes
>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading
>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>
>>>>>>> Would this also be a problem near any page boundary, say, a
>>>>>>> write length of 8191 bytes?
>>>>>>>
>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>
>>>>>>> -               pages = len >> PAGE_SHIFT;
>>>>>>> +               pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>
>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way.
>>>>>>
>>>>>>> And be sure to test with TCP as well.
>>>>>>
>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why.
>>>>>
>>>>> OK.
>>>>>
>>>>> Remember that a Read chunk's length does not have to be
>>>>> rounded up. Maybe the transport needs to round up the
>>>>> length of the unmarshaled data content on behalf of the
>>>>> NFSv4 write decoder.
>>>>>
>>>>
>>>> The problem of simply taking write->wr_buflen was that len before that
>>>> could have been adjusted by avail value in then non-RDAM mounts.
>>>>
>>>> Again, I'm not sure if this is the right fix. But this one works for
>>>> both non-RDMA and RDMA mounts.
>>>>
>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>> index 2c61c6b..3178997 100644
>>>> --- a/fs/nfsd/nfs4xdr.c
>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct
>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>
>>>>               len -= avail;
>>>>
>>>> -               pages = len >> PAGE_SHIFT;
>>>> +               if (!avail)
>>>> +                       pages = write->wr_buflen >> PAGE_SHIFT;
>>>> +               else
>>>> +                       pages = len >> PAGE_SHIFT;
>>>>               argp->pagelist += pages;
>>>>               argp->pagelen -= pages * PAGE_SIZE;
>>>>               len -= pages * PAGE_SIZE;
>>>
>>> This code has been around since 2012. Has it always been
>>> broken for RDMA? I suspect the root problem is in the
>>> transport, not here in the decoder. Can you bisect to
>>> find out where the problem started to occur?
>>
>> To see if I could go back to 2012, I first want to see if I could go
>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm
>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm
>> trying to verify if my CX-5 card would even work with mlx4 driver
>> which I'm having doubts it would.
>
> The CX-5 works only with mlx5.
>
>
>> These old kernels fail to boot in
>> XFS saying that it's a wrong version. Some googling makes me think
>> that since my XFS partition was created with a newer kernel it's not
>> compatible with an older kernel...
>
> That's correct, the XFS filesystem has new features like
> metadata checksum that don't work on older kernels.
>
> To start with, you could try some very late 3.x or early
> 4.x kernels. My intuition is that you will find an svcrdma
> change that is more recent than 2012 that is causing the
> underlying issue.

It works in 4.8 and does not work in 4.9. It's this commit that breaks it:

commit cc9d83408b52265ddab2874cf19d1611da4ca7ee

svcrdma: Server-side support for rpcrdma_connect_private

Prepare to receive an RDMA-CM private message when handling a new
connection attempt, and send a similar message as part of connection
acceptance.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
Reviewed-by: Sagi Grimberg <sagi@xxxxxxxxxxx>
Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx>

So far I don't understand how can causes problems...

>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> Chuck Lever
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux