On Tue, Mar 28, 2017 at 2:45 AM, Goel, Sameer <sgoel@xxxxxxxxxxxxxx> wrote: > Stack frame: > [ 1744.418958] [<ffff00000328936c>] get_nic_state+0x24/0x40 [mlx5_core] > [ 1744.425273] [<ffff0000032899c0>] health_recover+0x28/0x80 [mlx5_core] > [ 1744.431496] [<ffff0000080e3280>] process_one_work+0x150/0x460 > [ 1744.437218] [<ffff0000080e35e0>] worker_thread+0x50/0x4b8 > [ 1744.442609] [<ffff0000080e9b98>] kthread+0xd8/0xf0 > [ 1744.447377] [<ffff000008083330>] ret_from_fork+0x10/0x20 > > Summary: > This issue was seen on QDF2400 system 30 mins after while running speccpu 2006. During the test a recoverable PCIe error was seen that gave the following log: > [ 1673.170969] pcieport 0002:00:00.0: aer_status: 0x00004000, aer_mask: 0x00400000 > [ 1673.177961] pcieport 0002:00:00.0: aer_layer=Transaction Layer, aer_agent=Requester ID > [ 1673.185832] pcieport 0002:00:00.0: aer_uncor_severity: 0x00462030 > [ 1675.536391] mlx5_core 0002:01:00.0: assert_var[0] 0xffffffff > [ 1675.541093] mlx5_core 0002:01:00.0: assert_var[1] 0xffffffff > [ 1675.546750] mlx5_core 0002:01:00.0: assert_var[2] 0xffffffff > [ 1675.552377] mlx5_core 0002:01:00.0: assert_var[3] 0xffffffff > [ 1675.558040] mlx5_core 0002:01:00.0: assert_var[4] 0xffffffff > [ 1675.563661] mlx5_core 0002:01:00.0: assert_exit_ptr 0xffffffff > [ 1675.569488] mlx5_core 0002:01:00.0: assert_callra 0xffffffff > [ 1675.575120] mlx5_core 0002:01:00.0: fw_ver 15.4095.65535 > [ 1675.580426] mlx5_core 0002:01:00.0: hw_id 0xffffffff > [ 1675.585363] mlx5_core 0002:01:00.0: irisc_index 255 > [ 1675.590242] mlx5_core 0002:01:00.0: synd 0xff: unrecognized error > [ 1675.596301] mlx5_core 0002:01:00.0: ext_synd 0xffff > [ 1675.601209] mlx5_core 0002:01:00.0: mlx5_enter_error_state:120:(pid 7205): start > [ 1675.608613] mlx5_core 0002:01:00.0: mlx5_enter_error_state:127:(pid 7205): end > > After the above log we see the above stackframe and a page fault due to invalid dev pointer. > > So the the recovery work is queued and the timer is stopped. Somehow the workqueue is not cleared and when it runs the dev pointer is invalid. > > This issue was difficult to repro and was seen only once in multiple runs on a specific device. Hi Sameer, Thanks for the report, adding more relevant ppl Mohamad/Daniel Does the above ring a bell ? can you check ? Thanks Saeed. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html