Re: Lock recursion seen on qla2xxx client when rebooting the target server

Laurence Oberman <loberman@xxxxxxxxxx> · Tue, 02 Apr 2019 17:30:58 -0400

On Sun, 2019-03-31 at 20:44 -0400, Laurence Oberman wrote:
> This who have been following my trials and tribulations with SRP and
> block-mq panics (See Re: Panic when rebooting target server testing
> srp
> on 5.0.0-rc2) know I was going to run the same test with qla2xxx and
> F/C
> 
> Anyway, rebooting the targetserver (LIO) that was causing the block-
> mq
> race that is still out there and not yet diagnosed when SRP is the
> client causes issues with 5.1-rc2 as well.
> 
> The issue is different. I was seeing a total lockup and no console
> messages. To get the lockup message I had to enable lock debugging.
> 
> Anyway, Hannes, how have you folks not seen these issues at Suse with
> 5.1+ testing. Here I caught two different problems that are now
> latent
> in 5.1-x (maybe earlier too). This is a generic array reboot test
> that
> sadly is a common issue with our customewrs when they have fabric or
> array issues.
> 
> Kernel 5.1.0-rc2+ on an x86_64
> 
> localhost login: [  301.752492] BUG: spinlock cpu recursion on
> CPU#38,
> kworker/38:0/204
> [  301.782364]  lock: 0xffff90ddb2e43430, .magic: dead4ead, .owner:
> kworker/38:1/271, .owner_cpu: 38
> [  301.825496] CPU: 38 PID: 204 Comm: kworker/38:0 Kdump: loaded Not
> tainted 5.1.0-rc2+ #1
> [  301.863052] Hardware name: HP ProLiant ML150 Gen9/ProLiant ML150
> Gen9, BIOS P95 05/21/2018
> [  301.903614] Workqueue: qla2xxx_wq qla24xx_delete_sess_fn [qla2xxx]
> [  301.933561] Call Trace:
> [  301.945950]  dump_stack+0x5a/0x73
> [  301.962080]  do_raw_spin_lock+0x83/0xa0
> [  301.980287]  _raw_spin_lock_irqsave+0x66/0x80
> [  302.001726]  ? qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
> [  302.028111]  qla24xx_delete_sess_fn+0x34/0x90 [qla2xxx]
> [  302.052864]  process_one_work+0x215/0x4c0
> [  302.071940]  ? process_one_work+0x18c/0x4c0
> [  302.092228]  worker_thread+0x46/0x3e0
> [  302.110313]  kthread+0xfb/0x130
> [  302.125274]  ? process_one_work+0x4c0/0x4c0
> [  302.146054]  ? kthread_bind+0x10/0x10
> [  302.163789]  ret_from_fork+0x35/0x40
> 
> Just an FYI, with only 100 LUNS 4 paths i cannot boot the host
> without
> adding my watchdog_thresh=60 to the kernel line.
> I hard lockup during LUN discovery so that issue is also out there.
> 
> So far 5.x+ has been problemetic with regression testing.
> 
> Regards
> Laurence

I chatted with Himanshu about this and he will be sending me a test
patch. He thinks he knows what is going on here.
I will report back when tested.

Note!! to reitterate, this is not the block-mq issue I uncovered with
SRP testing. The investigation for that is still ongoing.

Thanks
Laurence