Re: Linux kernel v4.15-rc4 and rdma_rxe

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> The call trace in my previous e-mail was caused by a bug in the SRP initiator
> driver. I will post the patches that fix that bug after the holidays. But even
> after having fixed that bug I noticed a remarkable behavior difference between
> the mlx4_ib and rxe drivers. ib_srpt channels get closed properly when using
> the mlx4 driver but not when using the rxe driver. The test I ran is as follows:
> * Clone, build and install the kernel from branch block-scsi-for-next of
>   repository https://github.com/bvanassche/linux. Make sure that the SRP
>   initiator and target drivers are enabled in the kernel config. I plan to post
>   all patches that are in that repository and that are not yet upstream after
>   the holidays.
> * Clone https://github.com/bvanassche/srp-test.
> * Edit /etc/multipath.conf as indicated in the README.md document in the
>   srp-test repository.
> * Start multipathd.
> * If I run the following command on a system with a ConnectX-3 adapter:
>     srp-test/run_tests -d -r 10 -t 02-mq
>   then the test finishes after about 11 seconds.
>   But if I run the following command on a system without any RDMA adapters:
>     srp-test/run_tests -c -d -r 10 -t 02-mq
>   then the following output appears:
>
> Unloaded the ib_srpt kernel module
> Unloaded the rdma_rxe kernel module
> SoftRoCE network interfaces: rxe0
> Zero-initializing /dev/ram0 ... done
> Zero-initializing /dev/ram1 ... done
> Zero-initializing /dev/sdb ... done
> Configured SRP target driver
> Running test /home/bart/software/infiniband/srp-test/tests/02-mq ...
> Test file I/O on top of multipath concurrently with logout and login (0 min; mq)
> Using /dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b31000000000 -> ../../dm-2
> Unmounting /root/mnt1 from /dev/mapper/mpathb
> SRP LUN /sys/class/scsi_device/5:0:0:0 / sdc: removing /dev/dm-2: done
> SRP LUN /sys/class/scsi_device/5:0:0:1 / sde: removing /dev/dm-1: done
> SRP LUN /sys/class/scsi_device/5:0:0:2 / sdd: removing /dev/dm-0: done
> Unloaded the ib_srp kernel module
> Test /home/bart/software/infiniband/srp-test/tests/02-mq succeeded
> 1 tests succeeded and 0 tests failed
>
> [ test script hangs ]
>
> While the test script hangs the following appears in the system log (please note
> that the ib_srpt:srpt_zerolength_write_done: ib_srpt wc->status message is missing):
>
> ib_srpt:srpt_close_ch: ib_srpt 192.168.122.76-32: queued zerolength write
> [ ... ]
> ib_srpt srpt_disconnect_ch_sync(192.168.122.76-18 state 3): still waiting ...
> [ ... ]
> INFO: task rmdir:3215 blocked for more than 120 seconds.
>       Not tainted 4.15.0-rc4-dbg+ #2
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> rmdir           D13912  3215   3208 0x00000000
> Call Trace:
>  __schedule+0x2ad/0xb90
>  schedule+0x31/0x90
>  schedule_timeout+0x1fb/0x590
>  wait_for_completion_timeout+0x11a/0x180
>  srpt_close_session+0xba/0x180 [ib_srpt]
>  target_shutdown_sessions+0xc8/0xd0 [target_core_mod]
>  core_tpg_del_initiator_node_acl+0x7c/0x130 [target_core_mod]
>  target_fabric_nacl_base_release+0x20/0x30 [target_core_mod]
>  config_item_release+0x5a/0xc0 [configfs]
>  config_item_put+0x21/0x24 [configfs]
>  configfs_rmdir+0x1ef/0x2f0 [configfs]
>  vfs_rmdir+0x6e/0x150
>  do_rmdir+0x168/0x1c0
>  SyS_rmdir+0x11/0x20
>

Hi Bart
Thanks for the detailed answer.
1. I will do my best to add more tests to RXE regression. However, it
may take a while.
2. Differences in behavior doesn't necessarily mean that at least one
implementation is wrong. In what you describe it is hard to understand
what you think is wrong with RXE, If I understand it right the script
tried to delete a directory that ib_srpt owns (configs or such?) and
this operation waits for a completion. If this is right do you know
who is expected to call complete()? It sound unlikely that rxe is the
one.
3. Despite that, let's try this: when script hangs, can you run echo t
> /proc/sysrq-trigger and see if you something in dmesg that can
explain the hang? Maybe a trace that rdma_rxe is a part of it?

thanks
Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux