Possible race in completion with SRP after abort with latest upstream kernel 4.17.0+

Laurence Oberman <loberman@xxxxxxxxxx> · Thu, 07 Jun 2018 09:23:40 -0400

Hello Bart

Have not seen this more than twice but during testing of latest
upstream kernel with SRP I have had two of these completion races.

4.17.0+

[49945.984133] sd 2:0:0:29: alua: transition timeout set to 60 seconds
[49945.984136] sd 2:0:0:29: alua: port group 00 state A non-preferred
supports TOlUSNA
[49946.023273] sd 2:0:0:6: alua: port group 00 state A non-preferred
supports TOlUSNA
[49946.052514] sd 2:0:0:5: alua: port group 00 state A non-preferred
supports TOlUSNA
[49946.092895] sd 2:0:0:4: [sdl] Attached SCSI disk
[49946.093422] sd 2:0:0:6: alua: port group 00 state A non-preferred
supports TOlUSNA
[49953.156158] scsi host2: SRP abort called         ***** Abort
[49953.187444] sd 2:0:0:5: [sdm] Attached SCSI disk
[49953.211545] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[49965.632850] PGD 0 P4D 0 
[49965.644974] Oops: 0002 [#1] SMP PTI
[49965.661765] CPU: 11 PID: 2949 Comm: kworker/u64:0 Kdump: loaded
Tainted: G          I       4.17.0+ #1
[49965.711026] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[49965.742461] Workqueue: scsi_tmf_2 scmd_eh_abort_handler
[49965.770633] RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
[49965.795410] Code: 40 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
53 9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00
00 <f0> 0f b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 d7 26 92 ff eb f2 
[49965.892623] RSP: 0018:ffffb75e4789fdc0 EFLAGS: 00010046
[49965.920553] RAX: 0000000000000000 RBX: 0000000000000286 RCX:
0000000000000018
[49965.954952] RDX: 0000000000000001 RSI: 000000000000000a RDI:
0000000000000008
[49965.995180] RBP: 0000000000000000 R08: 0000000000000000 R09:
000000000000000a
[49966.033257] R10: 0000000000000000 R11: 0000000000000000 R12:
000000000000000a
[49966.073219] R13: ffff8d51f9041380 R14: 0000000000000000 R15:
ffff8d454df84d30
[49966.107885] FS:  0000000000000000(0000) GS:ffff8d52b3340000(0000)
knlGS:0000000000000000
[49966.150490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[49966.177976] CR2: 0000000000000008 CR3: 000000107300a005 CR4:
00000000000206e0
[49966.216606] Call Trace:
[49966.228353]  complete+0x18/0x50
[49966.243410]  scsi_end_request+0x95/0x1e0
[49966.263891]  scsi_io_completion+0x1c1/0x680
[49966.286617]  process_one_work+0x171/0x370
[49966.305850]  worker_thread+0x49/0x3f0
[49966.323408]  kthread+0xf8/0x130
[49966.341046]  ? max_active_store+0x80/0x80
[49966.362901]  ? kthread_bind+0x10/0x10
[49966.382485]  ret_from_fork+0x35/0x40

Looks like a race in completion

Pull request off stack

struct request {
  q = 0xffff8d51f979b0c0, 
  mq_ctx = 0xffffd752442e0600, 
  cpu = -1, 
  cmd_flags = 0, 
  rq_flags = 139456, 
  internal_tag = -1, 
  __data_len = 0, 
  tag = 56, 
  __sector = 8191008, 
  bio = 0x0, 
  biotail = 0xffff8d46acd9e700, 
  queuelist = {
    next = 0xffff8d454df84c40, 
    prev = 0xffff8d454df84c40
  }, 

struct gendisk {
  major = 67, 
  first_minor = 64, 
  minors = 16, 
  disk_name =
"sdba\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
00\000\000\000\000\000\000\000\000ba", 

crash> scsi_device.sdev_state 0xffff8d52b1da4800
  sdev_state = SDEV_RUNNING

crash> Scsi_Host.shost_state 0xffff8d466b969000
  shost_state = SHOST_RUNNING

               if (scsi_target(sdev)->single_lun ||
                    !list_empty(&sdev->host->starved_list))
                        kblockd_schedule_work(&sdev->requeue_work);

Have you seen this before, let me know what else you want from the dump
while I look further.
I have not tested for a while so not sure where this crept in or if its
even an issue for others.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html