[BUG] ethernet:mellanox:mlx5: Oops in health_recover get_nic_state(dev)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Stack frame:
[ 1744.418958] [<ffff00000328936c>] get_nic_state+0x24/0x40 [mlx5_core]
[ 1744.425273] [<ffff0000032899c0>] health_recover+0x28/0x80 [mlx5_core]
[ 1744.431496] [<ffff0000080e3280>] process_one_work+0x150/0x460
[ 1744.437218] [<ffff0000080e35e0>] worker_thread+0x50/0x4b8
[ 1744.442609] [<ffff0000080e9b98>] kthread+0xd8/0xf0
[ 1744.447377] [<ffff000008083330>] ret_from_fork+0x10/0x20

Summary:
This issue was seen on QDF2400 system 30 mins after while running speccpu 2006. During the test a recoverable PCIe error was seen that gave the following log:
[ 1673.170969] pcieport 0002:00:00.0: aer_status: 0x00004000, aer_mask: 0x00400000
[ 1673.177961] pcieport 0002:00:00.0: aer_layer=Transaction Layer, aer_agent=Requester ID
[ 1673.185832] pcieport 0002:00:00.0: aer_uncor_severity: 0x00462030
[ 1675.536391] mlx5_core 0002:01:00.0: assert_var[0] 0xffffffff
[ 1675.541093] mlx5_core 0002:01:00.0: assert_var[1] 0xffffffff
[ 1675.546750] mlx5_core 0002:01:00.0: assert_var[2] 0xffffffff
[ 1675.552377] mlx5_core 0002:01:00.0: assert_var[3] 0xffffffff
[ 1675.558040] mlx5_core 0002:01:00.0: assert_var[4] 0xffffffff
[ 1675.563661] mlx5_core 0002:01:00.0: assert_exit_ptr 0xffffffff
[ 1675.569488] mlx5_core 0002:01:00.0: assert_callra 0xffffffff
[ 1675.575120] mlx5_core 0002:01:00.0: fw_ver 15.4095.65535
[ 1675.580426] mlx5_core 0002:01:00.0: hw_id 0xffffffff
[ 1675.585363] mlx5_core 0002:01:00.0: irisc_index 255
[ 1675.590242] mlx5_core 0002:01:00.0: synd 0xff: unrecognized error
[ 1675.596301] mlx5_core 0002:01:00.0: ext_synd 0xffff
[ 1675.601209] mlx5_core 0002:01:00.0: mlx5_enter_error_state:120:(pid 7205): start
[ 1675.608613] mlx5_core 0002:01:00.0: mlx5_enter_error_state:127:(pid 7205): end

After the above log we see the above stackframe and a page fault due to invalid dev pointer.

So the the recovery work is queued and the timer is stopped. Somehow the workqueue is not cleared and when it runs the dev pointer is invalid.

This issue was difficult to repro and was seen only once in multiple runs on a specific device.

Thanks,
Sameer 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux