On Mon, Jun 3, 2024 at 5:59 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Jun 2, 2024, at 2:14 PM, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote: > > > > On Sun, Jun 2, 2024 at 5:40 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> > >> > >>> On Apr 30, 2024, at 10:45 AM, Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote: > >>> > >>> On 30.04.24 16:13, Chuck Lever III wrote: > >>>> It is possible to add rxe as a second option in kdevops, > >>>> but siw has worked for our purposes so far, and the NFS > >>>> test matrix is already enormous. > >>> > >>> Thanks. If rxe can be as a second option in kdevops, I will make tests with kdevops to check rxe work well or not in the future kernel version. > >> > >> As per our recent discussion, I have added rxe as a second > >> software RDMA option in kdevops. Proof of concept: > > > > Thanks a lot. I am very glad to know that rxe is treated as a second > > software RDMA option in kdeops. > > And I also checked the commit related with this feature. It is very > > complicated and huge. > > I split this into four smaller patches, HTH. > > > > I hope rxe can work well in kdeops. > > So I can also use kdeops to verify rxe and rdma subsystems. Thanks a > > lot your efforts. > > > >> > >> https://github.com/chucklever/kdevops/tree/add-rxe-support > >> > >> But basic rping testing is not working (with 6.10-rc1 kernels) > >> in this set-up. It's missing something... > > > > Just now I made tests with the latest rdma-core (rping is included in > > rdma-core) and 6.10-rc1 kernels. rping can work well. > > > > Normally rping works as a basic tool to verify if rxe works well or > > not. If rping can not work well, normally I will do the followings: > > 1. rping -s -a 127.0.0.1 > > rping -c -a 127.0.0.1 -C 3 -d -v > > This will verify whether rxe is configured correctly or not. > > I don't have rxe set up on loopback, so I substituted the host's > configured Ethernet IP. > > The tests works on the NFS server, but the rping client hangs > on the NFS client (both running v6.10-rc1). > > I rebooted in to the Fedora 39 stock kernel, and the rping tests > pass. > > However, when I try to run fstests with NFS/RDMA using rxe, the > client kernel reports a soft CPU lock-up, and top shows this: > > 115 root 20 0 0 0 0 R 99.3 0.0 1:03.50 kworker/u8:5+rxe_wq rxe_wq is introduced in the commit 9b4b7c1f9f54 "RDMA/rxe: Add workqueue support for rxe tasks". And this commit is merged into kernel v6.4-rc2-1-g9b4b7c1f9f54. And the Fedora 39 stock kernel is kernel 6.5. So maybe some commits between 6.5 and 6.10 introduce this problem. > > So I think this is enough to show that the Ansible parts of this > change are working as expected. I can push this to kdevops now > if there are no objections, and someone (maybe you, maybe me) can > sort out the rxe specific issues later. Thanks. After I can reproduce this problem in my local host, I am very glad to delve into this problem. Perhaps it will take me a long time since I do not have a good host to deploy kdeops. To be honest, perhaps "git bisec" can find the commit that introduce this problem. If you can find the commit, we can fix this problem very quickly^_^ Thanks, Zhu Yanjun > > > > 2. ping -c 3 server_ip on client host. > > This will verify whether the client host can connect to the server > > host or not. > > 3. rping -s -a server_ip > > rping -c -a server_ip -C 3 -d -v > > 1) shutdown firewall > > 2) tcpdump -ni xxxx to capture udp packets > > Normally the above steps can find out the errors in rxe client/server. > > Hope the above can help to find out the errors. > > > > Zhu Yanjun > > > >> > >> -- > >> Chuck Lever > >> > >> > > -- > Chuck Lever > >