On Fri, Aug 13, 2021 at 04:53:56PM -0500, Bob Pearson wrote: > On 8/4/21 4:05 AM, Zhu Yanjun wrote: > > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > >> > >> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > >>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@xxxxxxxxx> wrote: > >>>> > >>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> Can you please help me to understand the RXE status in the upstream? > >>>>> > >>>>> Does we still have crashes/interop issues/e.t.c? > >>>> > >>>> I made some developments with the RXE in the upstream, from my usage > >>>> with latest RXE, > >>>> I found the following: > >>>> > >>>> 1. rdma-core can not work well with latest RDMA git; > >>> > >>> The latest RDMA git is > >>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > >> > >> "Latest" is a relative term, what SHA did you test? > >> Let's focus on fixing RXE before we will continue with new features. > > > > Thanks a lot. I agree with you. > > > > rdma-core: > > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > > request #1038 from selvintxavier/master > > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > > 327d45e0 tests: Add missing MAC element to args list > > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > > be4d8abf bnxt_re/lib: add a function to initialize software queue > > > > kernel rdma: > > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > > RDMA/qedr: Improve error logs for rdma_alloc_tid error return > > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > > 991c4274dc17 RDMA/hfi1: Fix typo in comments > > 8d7e415d5561 docs: Fix infiniband uverbs minor number > > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > > that requests are valid > > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > > hfi1_devdata->user_refcount > > > > with the above kernel and rdma-core, the following messages will appear. > > " > > [ 54.214608] rdma_rxe: loaded > > [ 54.217089] infiniband rxe0: set active > > [ 54.217101] infiniband rxe0: added enp0s8 > > [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > > [ 169.074796] rdma_rxe: qp#27 moved to error state > > [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > > [ 169.138889] rdma_rxe: qp#30 moved to error state > > [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > > [ 169.160601] rdma_rxe: qp#31 moved to error state > > [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > > [ 169.182170] rdma_rxe: qp#32 moved to error state > > [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > > [ 169.667850] rdma_rxe: qp#39 moved to error state > > [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > > [ 200.332086] rdma_rxe: qp#58 moved to error state > > [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > > [ 200.396514] rdma_rxe: qp#61 moved to error state > > [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > > [ 200.417956] rdma_rxe: qp#62 moved to error state > > [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > > [ 200.439654] rdma_rxe: qp#63 moved to error state > > [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > > [ 200.933153] rdma_rxe: qp#70 moved to error state > > [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > > [ 208.360028] rdma_rxe: qp#89 moved to error state > > [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > > [ 208.425675] rdma_rxe: qp#92 moved to error state > > [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > > [ 208.447370] rdma_rxe: qp#93 moved to error state > > [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > > [ 208.469550] rdma_rxe: qp#94 moved to error state > > [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > > [ 208.956731] rdma_rxe: qp#100 moved to error state > > [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > > [ 218.363808] rdma_rxe: qp#119 moved to error state > > [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > > [ 218.429513] rdma_rxe: qp#122 moved to error state > > [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > > [ 218.451481] rdma_rxe: qp#123 moved to error state > > [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > > [ 218.473908] rdma_rxe: qp#124 moved to error state > > [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > > [ 218.963641] rdma_rxe: qp#130 moved to error state > > [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > > [ 235.305319] rdma_rxe: qp#149 moved to error state > > [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > > [ 235.368838] rdma_rxe: qp#152 moved to error state > > [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > > [ 235.390192] rdma_rxe: qp#153 moved to error state > > [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > > [ 235.411374] rdma_rxe: qp#154 moved to error state > > [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > > [ 235.895828] rdma_rxe: qp#161 moved to error state > > " > > Not sure if they are problems. > > IMO, we should make further investigations. > > > > Thanks > > Zhu Yanjun > >> > >> Thanks > > > > All of the messages are from the rxe driver caused by the python tests intentionally causing > > errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and > for_next. Does not answer the question about rping. That needs more testing. > (so ru is short for "./build/bin/run_tests.py --dev rxe_1") > > Bob > > rpearson:rdma-core$ sudo dmesg -C > > rpearson:rdma-core$ so ru > > .............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss > > ---------------------------------------------------------------------- > > Ran 199 tests in 0.418s > > > > OK (skipped=134) > > rpearson:rdma-core$ sudo dmesg > > [ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256 > > [ 9396.273120] rdma_rxe: qp#27 moved to error state > > [ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be > > [ 9396.283116] rdma_rxe: qp#30 moved to error state > > [ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d > > [ 9396.286501] rdma_rxe: qp#31 moved to error state > > [ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6 > > [ 9396.289922] rdma_rxe: qp#32 moved to error state > > [ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868 > > [ 9396.364854] rdma_rxe: qp#37 moved to error state You shouldn't print these errors by default, they need be *_dbg() level, Thanks > > > >