On Mon, Dec 23, 2024 2:25 AM Joe Klein <joe.klein812@xxxxxxxxx> wrote: > We have tested this patcheset and had a lot of problems, even without using the ODP option in softroce. I don't know if others have done similar tests. If we have to merge this patchset into upstream, is it > possible to add a kernel option to enable/disable this patchset? Hi Joe, Can you clarify the test and the problems you observed? I wonder if you tried the test with the latest tree WITHOUT my patches. As far as I know, there is something wrong with the upstream right now. It does not complete the rdma-core testcases, and 'segmentation fault' is observed in the middle of the full test run, which did not happen before October 2024. Here are the details of the issue: ===== test log ===== ubuntu@rdma-dev:~$ sudo rdma link add rxe_ens3 type rxe netdev ens3 ubuntu@rdma-dev:~$ cd rdma-core ubuntu@rdma-dev:~/rdma-core$ uname -r 6.13.0-rc1+ ubuntu@rdma-dev:~/rdma-core$ pwd /home/ubuntu/rdma-core ubuntu@rdma-dev:~/rdma-core$ ./build/bin/run_tests.py ..........ss.../usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpe7nsitov' mode='rb+' closefd=True> def _remove(item, selfref=ref(self)): ResourceWarning: Enable tracemalloc to get the object allocation traceback /usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpid85cbou' mode='rb+' closefd=True> def _remove(item, selfref=ref(self)): ResourceWarning: Enable tracemalloc to get the object allocation traceback .......ssssss/usr/lib/python3.12/contextlib.py:141: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmp9pgb7zo8' mode='rb+' closefd=True> def __exit__(self, typ, value, traceback): ResourceWarning: Enable tracemalloc to get the object allocation traceback ssss..............ssssssssssssssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........ssssssssssssssssss/usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpate1loci' mode='rb+' closefd=True> def _remove(item, selfref=ref(self)): ResourceWarning: Enable tracemalloc to get the object allocation traceback Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Exception ignored in: 'pyverbs.pd.PD.__dealloc__' Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor ssssTraceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Exception ignored in: 'pyverbs.pd.PD.__dealloc__' Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Exception ignored in: 'pyverbs.pd.PD.__dealloc__' Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor Exception ignored in: 'pyverbs.pd.PD.__dealloc__' Traceback (most recent call last): File "pd.pyx", line 120, in pyverbs.pd.PD.close pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor s....ssSegmentation fault (core dumped) =========== =====dmesg===== [ 147.464243] rxe_ens3: qp#21 make_send_cqe: non-flush error status = 4 [ 147.473843] rxe_ens3: qp#23 make_send_cqe: non-flush error status = 10 [ 147.484540] rxe_ens3: qp#25 make_send_cqe: non-flush error status = 9 [ 147.494541] rxe_ens3: qp#27 make_send_cqe: non-flush error status = 10 [ 147.524080] rxe_ens3: rxe_create_cq: returned err = -22 [ 147.574197] rxe_ens3: cq#26 rxe_resize_cq: returned err = -22 [ 147.605719] rxe_ens3: rxe_create_cq: returned err = -95 [ 147.606454] rxe_ens3: rxe_create_cq: returned err = -22 [ 148.803131] rxe_ens3: qp#51 make_send_cqe: non-flush error status = 10 [ 148.831587] rxe_ens3: qp#57 make_send_cqe: non-flush error status = 10 [ 148.841627] rxe_ens3: qp#59 make_send_cqe: non-flush error status = 10 [ 148.851719] rxe_ens3: qp#61 make_send_cqe: non-flush error status = 10 [ 149.104223] python3[1702]: segfault at d0 ip 00007ff95ced16c7 sp 00007fff5e775de0 error 4 in libibverbs.so.1.14.56.0[e6c7,7ff95ceca000+14000] likely on CPU 2 (core 0, socket 2) [ 149.104235] Code: 00 00 c1 e0 04 8b bf 08 01 00 00 48 8d 53 20 48 c7 43 28 00 00 00 00 83 c0 18 c7 43 34 00 00 00 00 be 01 1b 18 c0 66 89 43 20 <49> 8b 80 d0 00 00 00 8b 40 10 89 43 30 31 c0 e8 05 99 ff ff 41 89 ===== If you encounter any problems that surely comes from my ODP patches, please let me know what symptoms you are seeing. I would also appreciate any help you can offer in fixing the upstream issue. Thanks, Daisuke