On Sun, 2019-03-31 at 20:44 -0400, Laurence Oberman wrote: > This who have been following my trials and tribulations with SRP and > block-mq panics (See Re: Panic when rebooting target server testing > srp > on 5.0.0-rc2) know I was going to run the same test with qla2xxx and > F/C > > Anyway, rebooting the targetserver (LIO) that was causing the block- > mq > race that is still out there and not yet diagnosed when SRP is the > client causes issues with 5.1-rc2 as well. > > The issue is different. I was seeing a total lockup and no console > messages. To get the lockup message I had to enable lock debugging. > > Anyway, Hannes, how have you folks not seen these issues at Suse with > 5.1+ testing. Here I caught two different problems that are now > latent > in 5.1-x (maybe earlier too). This is a generic array reboot test > that > sadly is a common issue with our customewrs when they have fabric or > array issues. > > Kernel 5.1.0-rc2+ on an x86_64 > > localhost login: [ 301.752492] BUG: spinlock cpu recursion on > CPU#38, > kworker/38:0/204 > [ 301.782364] lock: 0xffff90ddb2e43430, .magic: dead4ead, .owner: > kworker/38:1/271, .owner_cpu: 38 > [ 301.825496] CPU: 38 PID: 204 Comm: kworker/38:0 Kdump: loaded Not > tainted 5.1.0-rc2+ #1 > [ 301.863052] Hardware name: HP ProLiant ML150 Gen9/ProLiant ML150 > Gen9, BIOS P95 05/21/2018 > [ 301.903614] Workqueue: qla2xxx_wq qla24xx_delete_sess_fn [qla2xxx] > [ 301.933561] Call Trace: > [ 301.945950] dump_stack+0x5a/0x73 > [ 301.962080] do_raw_spin_lock+0x83/0xa0 > [ 301.980287] _raw_spin_lock_irqsave+0x66/0x80 > [ 302.001726] ? qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx] > [ 302.028111] qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx] > [ 302.052864] process_one_work+0x215/0x4c0 > [ 302.071940] ? process_one_work+0x18c/0x4c0 > [ 302.092228] worker_thread+0x46/0x3e0 > [ 302.110313] kthread+0xfb/0x130 > [ 302.125274] ? process_one_work+0x4c0/0x4c0 > [ 302.146054] ? kthread_bind+0x10/0x10 > [ 302.163789] ret_from_fork+0x35/0x40 > > Just an FYI, with only 100 LUNS 4 paths i cannot boot the host > without > adding my watchdog_thresh=60 to the kernel line. > I hard lockup during LUN discovery so that issue is also out there. > > So far 5.x+ has been problemetic with regression testing. > > Regards > Laurence I chatted with Himanshu about this and he will be sending me a test patch. He thinks he knows what is going on here. I will report back when tested. Note!! to reitterate, this is not the block-mq issue I uncovered with SRP testing. The investigation for that is still ongoing. Thanks Laurence