Hi, I'm currently running tests with a Connect-IB board under the current OFED-3.12 of the day: - compat: 407b205 compat: Add kthread support for kernels <= 2.6.35 - compat-rdma: b2bda9f Fixed nfsrdma backport patch name - linux-3.12: f9e9918 Prepare Linux tree for OFED 3.12 the board is: # mstflint -d mlx5_0 q -W- Running quick query - Skipping full image integrity checks. Image type: FS3 FW Version: 10.10.2000 Device ID: 4113 Chip Revision: 0 Description: UID GuidsNumber Step Base GUID1: f4521403000bf580 8 1 Base GUID2: f4521403000bf588 8 1 Base MAC1: 0000f452140bf580 8 1 Base MAC2: 0000f452140bf588 8 1 Image VSD: Device VSD: PSID: MT_1220110019 When trying to restart the openibd service: # service openibd restart here is what I get: INFO: task rmmod:22654 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. rmmod D 0000000000000001 0 22654 22653 0x00000000 ffff88106f1b7b58 0000000000000082 0000000000000000 ffffffff81055f76 ffff88106f1b7ae8 ffff88107b0bb500 ffff88106f1b7ae8 ffffffff810522fd ffff88107a8e9af8 ffff88106f1b7fd8 000000000000fb88 ffff88107a8e9af8 Call Trace: [<ffffffff81055f76>] ? enqueue_task+0x66/0x80 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90 [<ffffffff8150e555>] schedule_timeout+0x215/0x2e0 [<ffffffff81096c96>] ? autoremove_wake_function+0x16/0x40 [<ffffffff81051419>] ? __wake_up_common+0x59/0x90 [<ffffffff8150e1d3>] wait_for_common+0x123/0x180 [<ffffffff81063310>] ? default_wake_function+0x0/0x20 [<ffffffff810912b1>] ? __queue_work+0x41/0x50 [<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20 [<ffffffffa05a3d18>] mlx5_cmd_exec+0x2d8/0x790 [mlx5_core] [<ffffffffa05a583e>] mlx5_cmd_teardown_hca+0x5e/0x90 [mlx5_core] [<ffffffffa05a10f9>] mlx5_dev_cleanup+0x69/0xe0 [mlx5_core] [<ffffffffa05da3c9>] remove_one+0x59/0x70 [mlx5_ib] [<ffffffff8129a047>] pci_device_remove+0x37/0x70 [<ffffffff8135e8bf>] __device_release_driver+0x6f/0xe0 [<ffffffff8135e9f8>] driver_detach+0xc8/0xd0 [<ffffffff8135d7fe>] bus_remove_driver+0x8e/0x110 [<ffffffff8135f1e2>] driver_unregister+0x62/0xa0 [<ffffffff8129a354>] pci_unregister_driver+0x44/0xb0 [<ffffffffa05e7349>] __exit_compat+0x15/0xbe [mlx5_ib] [<ffffffff810b4814>] sys_delete_module+0x194/0x260 [<ffffffff8151311e>] ? do_page_fault+0x3e/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 0000:01:00.0:wait_func:618:(pid 22654): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource 0000:01:00.0:mlx5_reclaim_startup_pages:419:(pid 22654): FW did not return all pages. giving up... 0000:01:00.0:wait_func:618:(pid 22654): MLX5_CMD_OP_DISABLE_HCA(0x105) timeout. Will cause a leak of a command resource Compat-rdma backport release: 435a602-c Backport based on linux-3.12 385a572 compat.git: linux-3.12 mlx5_ib: Mellanox Connect-IB Infiniband driver v1.0 (June 2013) mlx5_ib 0000:01:00.0: firmware version: 10.10.2000 0000:01:00.0:wait_func:618:(pid 25331): MLX5_CMD_OP_ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource mlx5_ib 0000:01:00.0: enable hca failed mlx5_ib: probe of 0000:01:00.0 failed with error -110 It looks like the driver fails to tear down the HCA, leaving the device in a completely unstable state needing a reboot. This behaviour is fully reproductible, although it _may_ succeed once or twice right after boot. Is this a FW problem, a driver problem? thanks, Sébastien. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html