Re: Spurious instability with NFSoRDMA under moderate load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Oct 29, 2021, at 2:17 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
> 
> On 29.10.2021 17:14, Chuck Lever III wrote:
>> Hi Timo-
>>> On Oct 29, 2021, at 7:47 AM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>>> 
>>> On 20/08/2021 17:12, Chuck Lever III wrote:
>>>> OK, I think the issue with this reproducer was resolved
>>>> completely with 6820bf77864d.
>>>> I went back and reviewed the traces from when the client got
>>>> stuck after a long uptime. This looks very different from
>>>> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
>>>> and BIND_CONN_TO_SESSION, which is a different scenario. Long
>>>> story short, I don't think we're getting any more value by
>>>> leaving 6820bf77864d reverted.
>>>> Can you re-apply that commit on your server, and then when
>>>> the client hangs again, please capture with:
>>>> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>>>> I'd like to see why the client's BIND_CONN_TO_SESSION fails
>>>> to repair the backchannel session.
>>> 
>>> Happened again today, after a long time of no issues.
>>> Still on 5.12.19, since the system did not have a chance for a bigger maintenance window yet.
>>> 
>>> Attached are traces from both client and server, while the client is trying to do the usual xfs_io copy_range.
>>> The system also has a bunch of other users and nodes working on it at this time, so there's a good chance for unrelated noise in the traces.
>>> 
>>> The affected client is 10.110.10.251.
>>> Other clients are working just fine, it's only this one client that's affected.
>>> 
>>> There was also quite a bit of heavy IO work going on on the Cluster, which I think coincides with the last couple times this happened as well.<nfstrace.tar.xz>
>> Thanks for the report. We believe this issue has been addressed in v5.15-rc:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02579b2ff8b0becfb51d85a975908ac4ab15fba8
> 
> 5.15 is a little too bleeding edge for my comfort to roll out on a production system.
> But the patch applies cleanly on top of 5.12.19. So I pulled it and am now running the resulting kernel on all clients and the server(s).

Yup, that's the best we can do for now. Thanks for testing!


> Hopefully won't see this happen again from now on, thanks!
> 
> 
> Timo

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux