Re: LUNs become unavailable with current git HEAD

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Mon, 14 Oct 2013 13:04:47 -0700

On Sat, 2013-10-12 at 09:28 +0200, Thomas Glanzmann wrote:
> Hallo Nab,
> I'll reproduce this over the weekend, but I'm not sure that I can, but I'll do
> my best.
> 
> A few more information of things, that we did:
> 
>         - Network configuration was as always two bonds with two links each
>           with mac hash, one ip per bond. Portal exposed both IPs.
> 
>         - We had iSCSI Port binding with round robin on. So we used
>           multiple iSCSI sessions per ESX server (4 active per LUN).
> 
>         - I dropped the buffer cache of the target to demonstrate how to
>           free up memory fast. The following lines are from my scroll back
>           buffer of screen:
> 
> (node-62) [~/work/linux-2.6] free
>              total       used       free     shared    buffers     cached
> Mem:      66083628   65708924     374704          0     230056   63309488
> -/+ buffers/cache:    2169380   63914248
> Swap:            0          0          0
> (node-62) [~/work/linux-2.6] /
> sithglan has logged on pts/0 from infra-vlan10.gmvl.de
> (node-62) [~/work/linux-2.6] sync; echo 3 > /proc/sys/vm/drop_caches
> free
> (node-62) [~/work/linux-2.6] free
>              total       used       free     shared    buffers     cached
> Mem:      66083628     547728   65535900          0        980       7616
> -/+ buffers/cache:     539132   65544496
> Swap:            0          0          0
> (node-62) [~/work/linux-2.6] dmesg | tail
> [   11.899264] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.101: link becomes ready
> [   11.899542] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.102: link becomes ready
> [   12.315195] igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
> [   12.398846] bonding: bond1: link status definitely up for interface eth2, 1000 Mbps full duplex.
> [   14.587938] Bridge firewalling registered
> [   14.665002] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [57628.200959] Detected MISCOMPARE for addr: ffff8805702cc000 buf: ffff880c4128a000
> [57628.200965] Target/fileio: Send MISCOMPARE check condition and sense
> [57628.336304] Detected MISCOMPARE for addr: ffff88066d090000 buf: ffff880c4128a000
> [57628.336310] Target/fileio: Send MISCOMPARE check condition and sense
> (node-62) [~/work/linux-2.6] exit
> (node-62) [~/work/linux-2.6] free
>              total       used       free     shared    buffers     cached
> Mem:      66083628     600624   65483004          0       2848      83100
> -/+ buffers/cache:     514676   65568952
> Swap:            0          0          0
> 
> However. Afterwards it was stable. And I did this one day before the I/O stall
> happened.
> 
>         - Shortly before the All Paths Down thing happened we upgraded 8 ESX
>           servers from 5.1 GA to the newest 5.1 patch available.
> 
>         - We had approx. 36 - 72 GB of static state (virtual machine hard
>           disks) on the particular two LUNs in question.
> 
>         - While the issue happed or shortly after or shortly before, I deployed
>           an 8 GB fully patched w2k3 VM. This was the only incident we had. The
>           other four days it was rock stable, no issues whatsoever and we tried
>           to stress it by doing the rescans/svMotions.
> 
> I don't know if this information helps, but I as I said I'll do my best in
> order to reproduce it. And we saw the issue on all 8 ESX servers, everything
> locked up until I did rebooted the target. Afterwards everything was fine, of
> course we had a few timed out tasks in vCenter but this were only symptoms.
> 

Hey Thomas,

I'm still confused wrt points from above, vs. what kern-2.log shows:

Oct 11 12:14:40 node-62 kernel: [220709.511271] ABORT_TASK: Found referenced iSCSI task_tag: 7795
Oct 11 12:14:40 node-62 kernel: [220709.511275] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7795
Oct 11 12:14:40 node-62 kernel: [220709.511297] ABORT_TASK: Found referenced iSCSI task_tag: 7797
Oct 11 12:14:40 node-62 kernel: [220709.511298] ABORT_TASK: ref_tag: 7797 already complete, skipping
Oct 11 12:14:40 node-62 kernel: [220709.511299] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797
Oct 11 12:14:40 node-62 kernel: [220709.511308] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7795
Oct 11 12:14:40 node-62 kernel: [220709.511322] ABORT_TASK: Found referenced iSCSI task_tag: 7797
Oct 11 12:14:40 node-62 kernel: [220709.511323] ABORT_TASK: ref_tag: 7797 already complete, skipping
Oct 11 12:14:40 node-62 kernel: [220709.511324] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797
Oct 11 12:14:40 node-62 kernel: [220709.511420] ABORT_TASK: Found referenced iSCSI task_tag: 37422
Oct 11 12:14:40 node-62 kernel: [220709.511422] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37422
Oct 11 12:14:40 node-62 kernel: [220709.511425] ABORT_TASK: Found referenced iSCSI task_tag: 37423
Oct 11 12:14:40 node-62 kernel: [220709.511426] ABORT_TASK: ref_tag: 37423 already complete, skipping
Oct 11 12:14:40 node-62 kernel: [220709.511427] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 37423
Oct 11 12:14:40 node-62 kernel: [220709.511503] ABORT_TASK: Found referenced iSCSI task_tag: 6574
Oct 11 12:14:40 node-62 kernel: [220709.511505] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6574
Oct 11 12:14:40 node-62 kernel: [220709.511619] ABORT_TASK: Found referenced iSCSI task_tag: 1593
Oct 11 12:14:40 node-62 kernel: [220709.511623] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1593
Oct 11 12:14:40 node-62 kernel: [220709.516794] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000009f
Oct 11 12:14:40 node-62 kernel: [220709.524610] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000a0
Oct 11 12:14:40 node-62 kernel: [220709.524690] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000007f
Oct 11 12:14:40 node-62 kernel: [220709.524805] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000013
Oct 11 12:14:40 node-62 kernel: [220709.546974] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000075
Oct 11 12:14:41 node-62 kernel: [220710.320215] kmem_cache_destroy lio_qr_cache: Slab cache still has objects
Oct 11 12:14:41 node-62 kernel: [220710.327218] CPU: 11 PID: 10178 Comm: rmmod Tainted: G           O 3.12.0-rc3+ #1
Oct 11 12:14:41 node-62 kernel: [220710.327222] Hardware name: Supermicro X9SRD-F/X9SRD-F, BIOS 1.0a 10/15/2012
Oct 11 12:14:41 node-62 kernel: [220710.327226]  0000000000000000 0000000000000000 ffffffff8137d7c1 ffff8810214093c0
Oct 11 12:14:41 node-62 kernel: [220710.327233]  ffffffff810d9a27 ffffffffa03e7e50 ffffffffa03d6d98 ffffffffa03d6d70
Oct 11 12:14:41 node-62 kernel: [220710.327238]  ffffffff8108442e ffffffff8105946c 00000000813806ec ffffffff81645210
Oct 11 12:14:41 node-62 kernel: [220710.327244] Call Trace:
Oct 11 12:14:41 node-62 kernel: [220710.327256]  [<ffffffff8137d7c1>] ? dump_stack+0x41/0x51
Oct 11 12:14:41 node-62 kernel: [220710.327282]  [<ffffffff810d9a27>] ? kmem_cache_destroy+0xcb/0xdf
Oct 11 12:14:41 node-62 kernel: [220710.327297]  [<ffffffffa03d6d98>] ? iscsi_target_cleanup_module+0x28/0x290 [iscsi_target_mod]
Oct 11 12:14:41 node-62 kernel: [220710.327308]  [<ffffffffa03d6d70>] ? iscsit_put_transport+0xf/0xf [iscsi_target_mod]
Oct 11 12:14:41 node-62 kernel: [220710.327315]  [<ffffffff8108442e>] ? SyS_delete_module+0x215/0x299
Oct 11 12:14:41 node-62 kernel: [220710.327321]  [<ffffffff8105946c>] ? should_resched+0x5/0x23
Oct 11 12:14:41 node-62 kernel: [220710.327327]  [<ffffffff81381b32>] ? page_fault+0x22/0x30
Oct 11 12:14:41 node-62 kernel: [220710.327333]  [<ffffffff81386722>] ? system_call_fastpath+0x16/0x1b

So the kmem_cache_destroy item indicates that the target service was
stopped very soon after the ABORT_TASKs occured.

Can you confirm that the target was stopped, and not restarted..?

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html