Re: Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Wed, 26 Oct 2016 00:20:46 -0700

Hello TomK & Co,

Comments below.

On Tue, 2016-10-25 at 22:05 -0400, TomK wrote:
> On 10/25/2016 1:28 AM, TomK wrote:
> > On 10/24/2016 2:36 AM, Nicholas A. Bellinger wrote:
> >> Hi TomK,
> >>
> >> Thanks for reporting this bug.  Comments inline below.
> >>
> >> On Mon, 2016-10-24 at 00:45 -0400, TomK wrote:
> >>> On 10/24/2016 12:32 AM, TomK wrote:
> >>>> On 10/23/2016 10:03 PM, TomK wrote:

<SNIP>

> >>>> Including the full log:
> >>>>
> >>>> http://microdevsys.com/linux-lio/messages-mailing-list
> >>>>
> >>>
> >>
> >> Thanks for posting with qla2xxx verbose debug enabled on your setup.
> >>
> >>>
> >>> When tryint to shut down target using /etc/init.d/target stop, the
> >>> following is printed repeatedly:
> >>>
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e837:20:
> >>> ABTS_RECV_24XX: instance 0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f811:20:
> >>> qla_target(0): task abort (s_id=1:5:0, tag=1177068, param=0)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f812:20:
> >>> qla_target(0): task abort for non-existant session
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f80e:20:
> >>> Scheduling work (type 1, prm ffff880093365680) to find session for param
> >>> ffff88010f8c7680 (size 64, tgt ffff880111f06600)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-f800:20: Sess
> >>> work (tgt ffff880111f06600)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
> >>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff880093365694,
> >>> status=4
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
> >>> ABTS_RESP_24XX: compl_status 31
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e807:20: Sending
> >>> retry TERM EXCH CTIO7 (ha=ffff88010fae0000)
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e806:20: Sending
> >>> task mgmt ABTS response (ha=ffff88010fae0000, atio=ffff88010f8c76c0,
> >>> status=0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e838:20:
> >>> ABTS_RESP_24XX: compl_status 0
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e872:20:
> >>> qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 029c
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-3861:20: New
> >>> command while device ffff880111f06600 is shutting down
> >>> Oct 24 00:39:48 mbpc-pc kernel: qla2xxx [0000:04:00.0]-e859:20:
> >>> qla_target: Unable to send command to target for req, ignoring.
> >>>
> >>>
> >>
> >> At your earliest convenience, please verify the patch using v4.8.y with
> >> the above ABORT_TASK + shutdown scenario.
> >>
> >> Also, it would be helpful to understand why this ESX FC host is
> >> generating ABORT_TASKs.
> >>
> >> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
> >> Or due to target backend I/O latency, that ultimately triggers FC host
> >> side timeouts...?
> >>

Ok, so the specific hung task warnings reported earlier above are
ABORT_TASK due to the target-core backend md array holding onto
outstanding I/O long enough, for ESX host side SCSI timeouts to begin to
trigger.

> >>>
> >>> + when I disable the ports on the brocade switch that we're using then
> >>> try to stop target, the following is printed:
> >>>
> >>>
> >>>
> >>> Oct 24 00:41:31 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
> >>> down - seconds remaining 231.
> >>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
> >>> down - seconds remaining 153.
> >>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> >>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
> >>> lib/list_debug.c:33 __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: list_add corruption. prev->next should
> >>> be next (ffff88009e83b330), but was ffff88011fc972a0.
> >>> (prev=ffff880118ada4c0).
> >>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
> >>> tcm_loop target_core_file target_core_iblock target_core_pscsi
> >>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
> >>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
> >>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
> >>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
> >>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
> >>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
> >>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
> >>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
> >>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
> >>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
> >>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
> >>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> >>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
> >>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
> >>> dm_region_hash dm_log dm_mod
> >>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3 Not
> >>> tainted 4.8.4 #2
> >>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
> >>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> >>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
> >>> ffffffff812e88e9 ffffffff8130753e
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
> >>> 0000000000000000 ffff880092b83b98
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
> >>> 0000002100000046 ffffffff8101eae8
> >>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
> >>> dump_stack+0x51/0x78
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>] ?
> >>> __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
> >>> __switch_to+0x398/0x7e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
> >>> warn_slowpath_fmt+0x49/0x50
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130753e>]
> >>> __list_add+0xbe/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
> >>> move_linked_works+0x62/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
> >>> process_one_work+0x25c/0x4e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
> >>> worker_thread+0x16d/0x520
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
> >>> __schedule+0x2fd/0x6a0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >>> default_wake_function+0x12/0x20
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >>> __wake_up_common+0x56/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >>> schedule_tail+0x1e/0xc0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
> >>> ret_from_fork+0x1f/0x40
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >>> kthread_freezable_should_stop+0x70/0x70
> >>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f464 ]---
> >>> Oct 24 00:41:32 mbpc-pc kernel: ------------[ cut here ]------------
> >>> Oct 24 00:41:32 mbpc-pc kernel: WARNING: CPU: 2 PID: 8615 at
> >>> lib/list_debug.c:36 __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: list_add double add:
> >>> new=ffff880118ada4c0, prev=ffff880118ada4c0, next=ffff88009e83b330.
> >>> Oct 24 00:41:32 mbpc-pc kernel: Modules linked in: tcm_qla2xxx tcm_fc
> >>> tcm_loop target_core_file target_core_iblock target_core_pscsi
> >>> target_core_mod configfs ip6table_filter ip6_tables ebtable_nat ebtables
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM
> >>> iptable_mangle bridge nfsd lockd grace nfs_acl auth_rpcgss autofs4 it87
> >>> hwmon_vid bnx2fc cnic uio fcoe libfcoe libfc 8021q garp stp llc ppdev
> >>> parport_pc parport sunrpc cpufreq_ondemand bonding ipv6 crc_ccitt
> >>> ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables fuse
> >>> vfat fat xfs vhost_net macvtap macvlan vhost tun uinput raid456
> >>> async_raid6_recov async_pq async_xor xor async_memcpy async_tx raid6_pq
> >>> libcrc32c joydev sg serio_raw e1000 kvm_amd kvm irqbypass r8169 mii
> >>> pcspkr k10temp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
> >>> snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm
> >>> snd_timer snd soundcore i2c_piix4 i2c_core wmi acpi_cpufreq shpchp ext4
> >>> mbcache jbd2 qla2xxx scsi_transport_fc floppy firewire_ohci f
> >>> Oct 24 00:41:32 mbpc-pc kernel: irewire_core crc_itu_t sd_mod pata_acpi
> >>> ata_generic pata_jmicron ahci libahci usb_storage dm_mirror
> >>> dm_region_hash dm_log dm_mod
> >>> Oct 24 00:41:32 mbpc-pc kernel: CPU: 2 PID: 8615 Comm: kworker/2:3
> >>> Tainted: G        W       4.8.4 #2
> >>> Oct 24 00:41:32 mbpc-pc kernel: Hardware name: Gigabyte Technology Co.,
> >>> Ltd. GA-890XA-UD3/GA-890XA-UD3, BIOS FC 08/02/2010
> >>> Oct 24 00:41:32 mbpc-pc kernel: 0000000000000000 ffff880092b83b48
> >>> ffffffff812e88e9 ffffffff8130751c
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffff880092b83ba8 ffff880092b83ba8
> >>> 0000000000000000 ffff880092b83b98
> >>> Oct 24 00:41:32 mbpc-pc kernel: ffffffff81066a7d ffff88000058f952
> >>> 0000002400000046 ffffffff8101eae8
> >>> Oct 24 00:41:32 mbpc-pc kernel: Call Trace:
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff812e88e9>]
> >>> dump_stack+0x51/0x78
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>] ?
> >>> __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066a7d>] __warn+0xfd/0x120
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8101eae8>] ?
> >>> __switch_to+0x398/0x7e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81066b59>]
> >>> warn_slowpath_fmt+0x49/0x50
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8130751c>]
> >>> __list_add+0x9c/0xd0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8107d0b2>]
> >>> move_linked_works+0x62/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108070c>]
> >>> process_one_work+0x25c/0x4e0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8108150d>]
> >>> worker_thread+0x16d/0x520
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162bb3d>] ?
> >>> __schedule+0x2fd/0x6a0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810923f2>] ?
> >>> default_wake_function+0x12/0x20
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> >>> __wake_up_common+0x56/0x90
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162c040>] ?
> >>> schedule+0x40/0xb0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff810813a0>] ?
> >>> maybe_create_worker+0x110/0x110
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8109130e>] ?
> >>> schedule_tail+0x1e/0xc0
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff8162f60f>]
> >>> ret_from_fork+0x1f/0x40
> >>> Oct 24 00:41:32 mbpc-pc kernel: [<ffffffff81085f20>] ?
> >>> kthread_freezable_should_stop+0x70/0x70
> >>> Oct 24 00:41:32 mbpc-pc kernel: ---[ end trace 713a9071c9f5f465 ]---
> >>> Oct 24 00:41:32 mbpc-pc kernel: qla2xxx [0000:04:00.1]-680a:21: Loop
> >>> down - seconds remaining 230.
> >>> Oct 24 00:41:33 mbpc-pc kernel: qla2xxx [0000:04:00.0]-680a:20: Loop
> >>> down - seconds remaining 152.
> >>>
> >>>
> >>
> >> Mmmm.  Could be a side effect of the target-core regression, but not
> >> completely sure..
> >>
> >> Adding QLOGIC folks CC'.
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>

Adding Anil CC'

> >
> > Hey Nicholas,
> >
> >
> >> At your earliest convenience, please verify the patch using v4.8.y with
> >> the above ABORT_TASK + shutdown scenario.
> >>
> >> Also, it would be helpful to understand why this ESX FC host is
> >> generating ABORT_TASKs.
> >>
> >> Eg: Is ABORT_TASK generated due to FC target response packet loss..?
> >> Or due to target backend I/O latency, that ultimately triggers FC host
> >> side timeouts...?
> >
> >
> > Here is where it gets interesting and to your thought above.  Take for
> > example this log snippet
> > (http://microdevsys.com/linux-lio/messages-recent):
> >
> > Oct 23 22:12:51 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:36:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> > Oct 23 22:14:14 mbpc-pc kernel: hpet1: lost 1 rtc interrupts
> > Oct 23 22:15:02 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:15:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:38:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:16:29 mbpc-pc kernel: hpet1: lost 9599 rtc interrupts
> > Oct 23 22:17:30 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:40:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:18:29 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> > task_tag: 1195032
> > Oct 23 22:18:33 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Sending
> > TMR_FUNCTION_COMPLETE for ref_tag: 1195032
> > Oct 23 22:18:42 mbpc-pc kernel: ABORT_TASK: Found referenced qla2xxx
> > task_tag: 1122276
> > Oct 23 22:19:35 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:42:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:20:41 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Oct 23 22:21:07 mbpc-pc kernel: INFO: task kworker/u16:8:308 blocked for
> > more than 120 seconds.
> > Oct 23 22:21:07 mbpc-pc kernel:      Not tainted 4.8.4 #2
> > Oct 23 22:21:07 mbpc-pc kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 23 22:21:07 mbpc-pc kernel: kworker/u16:8   D ffff880111b8fa18     0
> >   308      2 0x00000000
> > Oct 23 22:21:07 mbpc-pc kernel: Workqueue: tmr-fileio target_tmr_work
> > [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: ffff880111b8fa18 0000000000000400
> > ffff880112180480 ffff880111b8f998
> > Oct 23 22:21:07 mbpc-pc kernel: ffff88011107a380 ffffffff81f99ca0
> > ffffffff81f998ef ffff880100000000
> > Oct 23 22:21:07 mbpc-pc kernel: ffffffff812f27d9 0000000000000000
> > ffffe8ffffcda000 ffff880000000000
> > Oct 23 22:21:07 mbpc-pc kernel: Call Trace:
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff812f27d9>] ? number+0x2e9/0x310
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080169>] ?
> > start_flush_work+0x49/0x180
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162e7ec>]
> > schedule_timeout+0x9c/0xe0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810802ba>] ? flush_work+0x1a/0x40
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bd15c>] ?
> > console_unlock+0x35c/0x380
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162cfa0>]
> > wait_for_completion+0xc0/0xf0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923e0>] ?
> > try_to_wake_up+0x260/0x260
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6f84>]
> > __transport_wait_for_tasks+0xb4/0x1b0 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810bdd1f>] ?
> > vprintk_default+0x1f/0x30
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8115cc5c>] ? printk+0x46/0x48
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f70c4>]
> > transport_wait_for_tasks+0x44/0x60 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f46e2>]
> > core_tmr_abort_task+0xf2/0x160 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffffa08f6aa4>]
> > target_tmr_work+0x154/0x160 [target_core_mod]
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81080639>]
> > process_one_work+0x189/0x4e0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810d060c>] ?
> > del_timer_sync+0x4c/0x60
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108131e>] ?
> > maybe_create_worker+0x8e/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8108150d>]
> > worker_thread+0x16d/0x520
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810923f2>] ?
> > default_wake_function+0x12/0x20
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810a6f06>] ?
> > __wake_up_common+0x56/0x90
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> > maybe_create_worker+0x110/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162c040>] ? schedule+0x40/0xb0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff810813a0>] ?
> > maybe_create_worker+0x110/0x110
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085fec>] kthread+0xcc/0xf0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8109130e>] ?
> > schedule_tail+0x1e/0xc0
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff8162f60f>]
> > ret_from_fork+0x1f/0x40
> > Oct 23 22:21:07 mbpc-pc kernel: [<ffffffff81085f20>] ?
> > kthread_freezable_should_stop+0x70/0x70
> > Oct 23 22:21:52 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> > Jan  7 00:44:00 192.168.0.2 syslog: dhcpfwd : dhcp forwarder daemon
> > successfully started
> > Oct 23 22:23:03 mbpc-pc kernel: hpet1: lost 9600 rtc interrupts
> >
> >
> > And compare it to the following snippet
> > (http://microdevsys.com/linux-lio/iostat-tkx-interesting-bit.txt) taken
> > from this bigger iostat session
> > (http://microdevsys.com/linux-lio/iostat-tkx.txt):
> >

<SNIP>

> >
> >
> > We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> > 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> > mark when something occurs and it drops down to below 100% numbers.
> >
> > So I checked the array which shows all clean, even across reboots:
> >
> > [root@mbpc-pc ~]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
> >       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> > [UUUUUU]
> >       bitmap: 1/8 pages [4KB], 65536KB chunk
> >
> > unused devices: <none>
> > [root@mbpc-pc ~]#
> >
> >
> > Then I run smartctl across all disks and sure enough /dev/sdf prints this:
> >
> > [root@mbpc-pc ~]# smartctl -A /dev/sdf
> > smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> > Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
> >
> > Error SMART Values Read failed: scsi error badly formed scsi parameters
> > Smartctl: SMART Read Values failed.
> >
> > === START OF READ SMART DATA SECTION ===
> > [root@mbpc-pc ~]#
> >
> > So it would appear we found the root cause, a bad disk.  True the disk
> > is bad and I'll be replacing it however, even with a degraded disk
> > (checking now) the array functions just fine and I have no data loss.  I
> > only lost 1.  I would have to loose 3 to get a catastrophic failure on
> > this RAID6:
> >
> > [root@mbpc-pc ~]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid6 sdf[6](F) sda[5] sde[8] sdb[7] sdd[3] sdc[1]
> >       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5]
> > [UUUU_U]
> >       bitmap: 6/8 pages [24KB], 65536KB chunk
> >
> > unused devices: <none>
> > [root@mbpc-pc ~]# mdadm --detail /dev/md0
> > /dev/md0:
> >         Version : 1.2
> >   Creation Time : Mon Mar 26 00:06:24 2012
> >      Raid Level : raid6
> >      Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
> >    Raid Devices : 6
> >   Total Devices : 6
> >     Persistence : Superblock is persistent
> >
> >   Intent Bitmap : Internal
> >
> >     Update Time : Tue Oct 25 00:31:13 2016
> >           State : clean, degraded
> >  Active Devices : 5
> > Working Devices : 5
> >  Failed Devices : 1
> >   Spare Devices : 0
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >            Name : mbpc:0
> >            UUID : 2f36ac48:5e3e4c54:72177c53:bea3e41e
> >          Events : 118368
> >
> >     Number   Major   Minor   RaidDevice State
> >        8       8       64        0      active sync   /dev/sde
> >        1       8       32        1      active sync   /dev/sdc
> >        7       8       16        2      active sync   /dev/sdb
> >        3       8       48        3      active sync   /dev/sdd
> >        8       0        0        8      removed
> >        5       8        0        5      active sync   /dev/sda
> >
> >        6       8       80        -      faulty   /dev/sdf
> > [root@mbpc-pc ~]#
> >
> > Last night I cut power to the /dev/sdf disk to spin it down the removed
> > it and reinserted it.  The array resynced without issue however the
> > smartctl -A command still failed on it. Today I check and bad blocks
> > were recorded on the disk and the array has since removed /dev/sdf (per
> > above).  Also I have to say that these ESXi hosts worked in this
> > configuration, without any hickup, for about 4 months.  No LUN failure
> > on the ESXi side.  I haven't changed the LUN in that time (had no reason
> > to do so).
> >
> > So now here's the real question that I have.  Why would the array
> > continue to function, as intended, with only one disk failure yet the
> > QLogic / Target drivers stop and error out?  The RAID6 (software) array
> > should care about the failure, and it should handle it.  QLogic / Target
> > Drivers shouldn't really be too impacted (aside from read speed maybe)
> > about a disk failing inside the array.  That would be my thinking.  The
> > Target / QLogic software seems to have picked up on a failure ahead of
> > the software RAID 6 detecting it.  I've had this RAID6 for over 6 years
> > now.  Aside from the occassional disk replacement, quite rock solid.

The earlier hung task warnings after ABORT_TASK w/ TMR_FUNCTION_COMPLETE
and after explicit configfs shutdown are likely the missing SCF_ACK_KREF
bit assignment.  Note the bug is specific to high backed I/O latency
with v4.1+ code, so you'll want to include it for all future builds.

AFAICT thus far the list corruption bug reported here and also from Anil
& Co looks like a separate bug using tcm_qla2xxx ports.

> >
> > So anyway, I added the fix you pointed out to the 4.8.4 kernel and
> > recompiled.  I restarted it, with the RAID6 degraded as it is.  All
> > mounted fine and I checked the LUN's from the ESXi side:
> >
> > [root@mbpc-pc ~]# /etc/init.d/target start
> > The Linux SCSI Target is already stopped                   [  OK  ]
> > [info] The Linux SCSI Target looks properly installed.
> > The configfs filesystem was not mounted, consider adding it[WARNING]
> > [info] Loaded core module target_core_mod.
> > [info] Loaded core module target_core_pscsi.
> > [info] Loaded core module target_core_iblock.
> > [info] Loaded core module target_core_file.
> > Failed to load fabric module ib_srpt                       [WARNING]
> > Failed to load fabric module tcm_usb_gadget                [WARNING]
> > [info] Loaded fabric module tcm_loop.
> > [info] Loaded fabric module tcm_fc.
> > Failed to load fabric module vhost_scsi                    [WARNING]
> > [info] Loaded fabric module tcm_qla2xxx.
> > Failed to load fabric module iscsi_target_mod              [WARNING]
> > [info] Loading config from /etc/target/scsi_target.lio, this may take
> > several minutes for FC adapters.
> > [info] Loaded /etc/target/scsi_target.lio.
> > Started The Linux SCSI Target                              [  OK  ]
> > [root@mbpc-pc ~]#
> >
> >
> > Enabled the brocade ports:
> >
> >
> >  18  18   011200   id    N4   No_Light    FC
> >  19  19   011300   id    N4   No_Sync     FC  Disabled (Persistent)
> >  20  20   011400   id    N4   No_Light    FC
> >  21  21   011500   id    N4   No_Light    FC
> >  22  22   011600   id    N4   No_Light    FC
> >  23  23   011700   id    N4   No_Light    FC  Disabled (Persistent)
> >  24  24   011800   --    N4   No_Module   FC  (No POD License) Disabled
> >  25  25   011900   --    N4   No_Module   FC  (No POD License) Disabled
> >  26  26   011a00   --    N4   No_Module   FC  (No POD License) Disabled
> >  27  27   011b00   --    N4   No_Module   FC  (No POD License) Disabled
> >  28  28   011c00   --    N4   No_Module   FC  (No POD License) Disabled
> >  29  29   011d00   --    N4   No_Module   FC  (No POD License) Disabled
> >  30  30   011e00   --    N4   No_Module   FC  (No POD License) Disabled
> >  31  31   011f00   --    N4   No_Module   FC  (No POD License) Disabled
> > sw0:admin> portcfgpersistentenable 19
> > sw0:admin> portcfgpersistentenable 23
> > sw0:admin> date
> > Tue Oct 25 04:03:42 UTC 2016
> > sw0:admin>
> >
> > And still after 30 minutes, there is no failure.  This run includes the
> > fix you asked me to ass
> > (https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e)
> >  .  If everything works, I will revert the patch and see if I can
> > reproduce the issue.  If I can reproduce it, then the disk might not
> > have been it, but the patch was.  I'll keep you posted on that when I
> > get a new disk tomorrow.
> >
> >
> > Right now this is a POC setup so I have lots of room to experiment.
> >
> 
> Hey Nicholas,
> 
> I've done some testing up till now.  With or without the patch above, as 
> long as the faulty disk is removed from the RAID 6 software array, 
> everything works fine with the Target Driver and ESXi hosts.

Thanks again for the extra debug + feedback, and confirming the earlier
hung task warnings with md disk failure.

>   This is 
> even on a degraded array:
> 
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdc[1] sda[5] sdd[3] sde[8] sdb[7]
>        3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/5] 
> [UUUU_U]
>        bitmap: 6/8 pages [24KB], 65536KB chunk
> 
> unused devices: <none>
> [root@mbpc-pc ~]#
> 
> So at the moment, the data points to the single failed single disk ( 
> /dev/sdf ) as causing the Target Drivers or QLogic cards to throw an 
> exception.
> 
> Tomorrow I will insert the failed disk back in to see if the a) array 
> takes it back, b) it causes a failure with the patch applied.
> 
> Looks like the failed disk /dev/sdf was limping along for months and 
> until I removed the power, it didn't collapse on itself.
> 

AFAICT, the list corruption observed is a separate bug from the hung
tasks during ABORT_TASK w/ TMR_COMPLETE_FUNCTION with explicit target
shutdown.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html