Re: [RFC PATCH 0/4] NFS: Fix another 'check_flush_dependency' splat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Jun 3, 2024, at 12:54 PM, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote:
> 
> On Mon, Jun 3, 2024 at 5:59 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Jun 2, 2024, at 2:14 PM, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote:
>>> 
>>> On Sun, Jun 2, 2024 at 5:40 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>> 
>>>>> On Apr 30, 2024, at 10:45 AM, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote:
>>>>> 
>>>>> On 30.04.24 16:13, Chuck Lever III wrote:
>>>>>> It is possible to add rxe as a second option in kdevops,
>>>>>> but siw has worked for our purposes so far, and the NFS
>>>>>> test matrix is already enormous.
>>>>> 
>>>>> Thanks. If rxe can be as a second option in kdevops, I will make tests with kdevops to check rxe work well or not in the future kernel version.
>>>> 
>>>> As per our recent discussion, I have added rxe as a second
>>>> software RDMA option in kdevops. Proof of concept:
>>> 
>>> Thanks a lot. I am very glad to know that rxe is treated as a second
>>> software RDMA option in kdeops.
>>> And I also checked the commit related with this feature. It is very
>>> complicated and huge.
>> 
>> I split this into four smaller patches, HTH.
>> 
>> 
>>> I hope rxe can work well in kdeops.
>>> So I can also use kdeops to verify rxe and rdma subsystems.  Thanks a
>>> lot your efforts.
>>> 
>>>> 
>>>> https://github.com/chucklever/kdevops/tree/add-rxe-support
>>>> 
>>>> But basic rping testing is not working (with 6.10-rc1 kernels)
>>>> in this set-up. It's missing something...
>>> 
>>> Just now I made tests with the latest rdma-core (rping is included in
>>> rdma-core) and 6.10-rc1 kernels. rping can work well.
>>> 
>>> Normally rping works as a basic tool to verify if rxe works well or
>>> not.  If rping can not work well, normally I will do the followings:
>>> 1. rping -s -a 127.0.0.1
>>>   rping -c -a 127.0.0.1 -C 3 -d -v
>>>   This will verify whether rxe is configured correctly or not.
>> 
>> I don't have rxe set up on loopback, so I substituted the host's
>> configured Ethernet IP.
>> 
>> The tests works on the NFS server, but the rping client hangs
>> on the NFS client (both running v6.10-rc1).
>> 
>> I rebooted in to the Fedora 39 stock kernel, and the rping tests
>> pass.
>> 
>> However, when I try to run fstests with NFS/RDMA using rxe, the
>> client kernel reports a soft CPU lock-up, and top shows this:
>> 
>>    115 root      20   0       0      0      0 R  99.3   0.0   1:03.50 kworker/u8:5+rxe_wq
> 
> rxe_wq is introduced in the commit 9b4b7c1f9f54 "RDMA/rxe: Add
> workqueue support for rxe tasks".
> And this commit is merged into kernel v6.4-rc2-1-g9b4b7c1f9f54.
> 
> And the Fedora 39 stock kernel is kernel 6.5. So maybe some commits
> between 6.5 and 6.10 introduce this problem.

I couldn't get 6.10-rc1 working at all. This failure occurred
with the stock Fedora 39 kernel and fstests with NFS v4.2 on
RDMA.


>> So I think this is enough to show that the Ansible parts of this
>> change are working as expected. I can push this to kdevops now
>> if there are no objections, and someone (maybe you, maybe me) can
>> sort out the rxe specific issues later.
> 
> Thanks. After I can reproduce this problem in my local host, I am very
> glad to delve into this problem. Perhaps it will take me a long time
> since I do not have a good host to deploy kdevops.

kdevops works on laptops too. The limiting factor seems to be
memory for libvirt guests. Only two guests are needed for this
test.


> To be honest, perhaps "git bisec" can find the commit that introduce
> this problem. If you can find the commit, we can fix this problem very
> quickly^_^

Since this is the first time I've ever used rxe, I don't have a
"good" commit to start from.


> Thanks,
> Zhu Yanjun
> 
>> 
>> 
>>> 2. ping -c 3 server_ip on client host.
>>>   This will verify whether the client host can connect to the server
>>> host or not.
>>> 3. rping -s -a server_ip
>>>   rping -c -a server_ip -C 3 -d -v
>>>   1) shutdown firewall
>>>   2) tcpdump -ni xxxx to capture udp packets
>>> Normally the above steps can find out the errors in rxe client/server.
>>> Hope the above can help to find out the errors.
>>> 
>>> Zhu Yanjun
>>> 
>>>> 
>>>> --
>>>> Chuck Lever
>>>> 
>>>> 
>> 
>> --
>> Chuck Lever


--
Chuck Lever






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux