Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21-07-17 15:07, Pascal de Bruijn wrote:
On 11-07-17 19:52, Nicholas A. Bellinger wrote:
Hi Pascal,

On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote:
On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
On 07-07-17 06:26, Nicholas A. Bellinger wrote:
On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
So abort_task can still be observed, but they do not result in a
non-functional not-quite-PANICked machine anymore.
Thank alot for the bug report and your continuous testing to get this resolved. The patch is queued up in target-pending/for-next with your
Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
kernels get this fix as well.

Thanks again.
I'm afraid we may not be quite there yet after all...

So we had the other two machines run an md check this weekend
as well, again with a rediculously high synx_speed_max:

Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0 Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k. Jul 9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332 Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332 Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380 Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380 Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620 Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140
This means the outstanding I/O was located, but never completed back.

Or, there is something else blocked waiting for completion like
outstanding qla2xxx WRITE transfer, before the abort may proceed.

Jul 9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.

Btw, where there any hung task warnings while this was happening..?

The machine in question was still responsive (was accepting
SSH logins), however it seemed VMware hosts weren't seeing
the volume anymore (presumably due to the heavy IO on the backend).

Also, several (12?) kworkers seemed stuck in a D state.

Next time that happens, do a 'cat /proc/$PID/stack' to see where those
kworkers are blocked in un-interruptible sleep.

When my collegue tried to reboot the machine it got
(presumably) stuck on

/usr/bin/targetctl clear

After which it was forcefully rebooted :)

Sorry we don't have any more detailed info at this point.
We haven't been able to reproduce this on a
different machine yet :(
Ok, please try with the following debug patch to verify if it's blocked
on qla_tgt_cmd->write_pending_abort_comp.

diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
index 8c1bf9b..1199969 100644
--- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
+++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
@@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
      if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
          se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
+ printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
wait_for_completion(&cmd->write_pending_abort_comp);
          return 0;
      }
@@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) * waiting upon completion in tcm_qla2xxx_write_pending_status()
           */
          if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+            u64 tag = cmd->se_cmd.tag;
+
+ printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
complete(&cmd->write_pending_abort_comp);
+ printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
              return;
          }
Whoops, mixed up the printk() output above..

Please use this one instead:

diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
index 8c1bf9b..01ffb8a 100644
--- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
+++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
@@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
  {
struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd);
      unsigned long flags;
+    u64 tag = se_cmd->tag;
+
      /*
* Check for WRITE_PENDING status to determine if we need to wait for * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data(). @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
      if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
          se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
          spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
+ printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
wait_for_completion(&cmd->write_pending_abort_comp);
+        printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
          return 0;
      }
      spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
@@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
       */
      cmd->cmd_in_wq = 0;
  +    if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+ printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n",
+            cmd->se_cmd.tag, cmd->write_data_transferred);
+    }
+
      cmd->vha->tgt_counters.qla_core_ret_ctio++;
      if (!cmd->write_data_transferred) {
          /*
@@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) * waiting upon completion in tcm_qla2xxx_write_pending_status()
           */
          if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+ printk("Completing write_pending_abort_comp for tag: %llu\n",
+                cmd->se_cmd.tag);
              complete(&cmd->write_pending_abort_comp);
              return;
          }

We still haven't been able to reliably reproduce, however we did come across this, that's very likely related.

Essentially we're running 'echo check > /sys/block/md0/md/sync_action' throughout the night, with no practical sync_speed_max limit.

Only in the morning presumably when I was doing stuff in VMware (I don't particularly recall what), VMware started reporting the volume as inaccessible.

Here's the full dmesg, to provide maximum context (sorry for the long post):

... SNIP ...
[Fri Jul 21 10:56:23 2017] INFO: task kworker/6:2:788 blocked for more than 180 seconds.
[Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:56:23 2017] kworker/6:2     D    0   788      2 0x00000000
[Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebcc9ce8000 ffff9ebcffc95080 [Fri Jul 21 10:56:23 2017] ffff9ebcdc5c9780 ffffb11c24e67b28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:56:23 2017] ffff9ebcc9ce8000 7fffffffffffffff 0000000000000000 ffff9ebcc9ce8000
[Fri Jul 21 10:56:23 2017] Call Trace:
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

I hope this helps, let me know if you need anything from our end.

I forgot to clarify, so this was with 4.9.38, with the following patches applied:

1eb42f965cedafb700e9c902ddafb1c51e3117f7.patch
4db6a8145940d0bbd10265020d681961ce2d3238.patch
linux-4.9.38-target-filter-0x85-scsi-opcde-warning.patch (see below)
nab-debug.patch (the patch your asked me to apply)

Regards,
Pascal de Bruijn


diff -Nurpd a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c --- a/drivers/target/target_core_transport.c 2017-07-15 12:17:55.000000000 +0200 +++ b/drivers/target/target_core_transport.c 2017-07-20 09:51:21.494390926 +0200
@@ -1337,7 +1337,7 @@ target_setup_cmd_from_cdb(struct se_cmd
     trace_target_sequencer_start(cmd);

     ret = dev->transport->parse_cdb(cmd);
-    if (ret == TCM_UNSUPPORTED_SCSI_OPCODE)
+ if (ret == TCM_UNSUPPORTED_SCSI_OPCODE && cmd->t_task_cdb[0] != 0x85) /* filter VMware ATA PASS THROUGH for SMART */ pr_warn_ratelimited("%s/%s: Unsupported SCSI Opcode 0x%02x, sending CHECK_CONDITION.\n",
                     cmd->se_tfo->get_fabric_name(),
cmd->se_sess->se_node_acl->initiatorname,


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux