On Sat, 2013-10-12 at 09:28 +0200, Thomas Glanzmann wrote: > Hallo Nab, > I'll reproduce this over the weekend, but I'm not sure that I can, but I'll do > my best. > > A few more information of things, that we did: > > - Network configuration was as always two bonds with two links each > with mac hash, one ip per bond. Portal exposed both IPs. > > - We had iSCSI Port binding with round robin on. So we used > multiple iSCSI sessions per ESX server (4 active per LUN). > > - I dropped the buffer cache of the target to demonstrate how to > free up memory fast. The following lines are from my scroll back > buffer of screen: > > (node-62) [~/work/linux-2.6] free > total used free shared buffers cached > Mem: 66083628 65708924 374704 0 230056 63309488 > -/+ buffers/cache: 2169380 63914248 > Swap: 0 0 0 > (node-62) [~/work/linux-2.6] / > sithglan has logged on pts/0 from infra-vlan10.gmvl.de > (node-62) [~/work/linux-2.6] sync; echo 3 > /proc/sys/vm/drop_caches > free > (node-62) [~/work/linux-2.6] free > total used free shared buffers cached > Mem: 66083628 547728 65535900 0 980 7616 > -/+ buffers/cache: 539132 65544496 > Swap: 0 0 0 > (node-62) [~/work/linux-2.6] dmesg | tail > [ 11.899264] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.101: link becomes ready > [ 11.899542] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.102: link becomes ready > [ 12.315195] igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX > [ 12.398846] bonding: bond1: link status definitely up for interface eth2, 1000 Mbps full duplex. > [ 14.587938] Bridge firewalling registered > [ 14.665002] Rounding down aligned max_sectors from 4294967295 to 4294967288 > [57628.200959] Detected MISCOMPARE for addr: ffff8805702cc000 buf: ffff880c4128a000 > [57628.200965] Target/fileio: Send MISCOMPARE check condition and sense > [57628.336304] Detected MISCOMPARE for addr: ffff88066d090000 buf: ffff880c4128a000 > [57628.336310] Target/fileio: Send MISCOMPARE check condition and sense > (node-62) [~/work/linux-2.6] exit > (node-62) [~/work/linux-2.6] free > total used free shared buffers cached > Mem: 66083628 600624 65483004 0 2848 83100 > -/+ buffers/cache: 514676 65568952 > Swap: 0 0 0 > > However. Afterwards it was stable. And I did this one day before the I/O stall > happened. > > - Shortly before the All Paths Down thing happened we upgraded 8 ESX > servers from 5.1 GA to the newest 5.1 patch available. > > - We had approx. 36 - 72 GB of static state (virtual machine hard > disks) on the particular two LUNs in question. > > - While the issue happed or shortly after or shortly before, I deployed > an 8 GB fully patched w2k3 VM. This was the only incident we had. The > other four days it was rock stable, no issues whatsoever and we tried > to stress it by doing the rescans/svMotions. > > I don't know if this information helps, but I as I said I'll do my best in > order to reproduce it. And we saw the issue on all 8 ESX servers, everything > locked up until I did rebooted the target. Afterwards everything was fine, of > course we had a few timed out tasks in vCenter but this were only symptoms. > Hey Thomas, I'm still confused wrt points from above, vs. what kern-2.log shows: Oct 11 12:14:40 node-62 kernel: [220709.511271] ABORT_TASK: Found referenced iSCSI task_tag: 7795 Oct 11 12:14:40 node-62 kernel: [220709.511275] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7795 Oct 11 12:14:40 node-62 kernel: [220709.511297] ABORT_TASK: Found referenced iSCSI task_tag: 7797 Oct 11 12:14:40 node-62 kernel: [220709.511298] ABORT_TASK: ref_tag: 7797 already complete, skipping Oct 11 12:14:40 node-62 kernel: [220709.511299] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797 Oct 11 12:14:40 node-62 kernel: [220709.511308] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7795 Oct 11 12:14:40 node-62 kernel: [220709.511322] ABORT_TASK: Found referenced iSCSI task_tag: 7797 Oct 11 12:14:40 node-62 kernel: [220709.511323] ABORT_TASK: ref_tag: 7797 already complete, skipping Oct 11 12:14:40 node-62 kernel: [220709.511324] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797 Oct 11 12:14:40 node-62 kernel: [220709.511420] ABORT_TASK: Found referenced iSCSI task_tag: 37422 Oct 11 12:14:40 node-62 kernel: [220709.511422] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37422 Oct 11 12:14:40 node-62 kernel: [220709.511425] ABORT_TASK: Found referenced iSCSI task_tag: 37423 Oct 11 12:14:40 node-62 kernel: [220709.511426] ABORT_TASK: ref_tag: 37423 already complete, skipping Oct 11 12:14:40 node-62 kernel: [220709.511427] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 37423 Oct 11 12:14:40 node-62 kernel: [220709.511503] ABORT_TASK: Found referenced iSCSI task_tag: 6574 Oct 11 12:14:40 node-62 kernel: [220709.511505] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6574 Oct 11 12:14:40 node-62 kernel: [220709.511619] ABORT_TASK: Found referenced iSCSI task_tag: 1593 Oct 11 12:14:40 node-62 kernel: [220709.511623] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1593 Oct 11 12:14:40 node-62 kernel: [220709.516794] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000009f Oct 11 12:14:40 node-62 kernel: [220709.524610] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000a0 Oct 11 12:14:40 node-62 kernel: [220709.524690] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000007f Oct 11 12:14:40 node-62 kernel: [220709.524805] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000013 Oct 11 12:14:40 node-62 kernel: [220709.546974] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000075 Oct 11 12:14:41 node-62 kernel: [220710.320215] kmem_cache_destroy lio_qr_cache: Slab cache still has objects Oct 11 12:14:41 node-62 kernel: [220710.327218] CPU: 11 PID: 10178 Comm: rmmod Tainted: G O 3.12.0-rc3+ #1 Oct 11 12:14:41 node-62 kernel: [220710.327222] Hardware name: Supermicro X9SRD-F/X9SRD-F, BIOS 1.0a 10/15/2012 Oct 11 12:14:41 node-62 kernel: [220710.327226] 0000000000000000 0000000000000000 ffffffff8137d7c1 ffff8810214093c0 Oct 11 12:14:41 node-62 kernel: [220710.327233] ffffffff810d9a27 ffffffffa03e7e50 ffffffffa03d6d98 ffffffffa03d6d70 Oct 11 12:14:41 node-62 kernel: [220710.327238] ffffffff8108442e ffffffff8105946c 00000000813806ec ffffffff81645210 Oct 11 12:14:41 node-62 kernel: [220710.327244] Call Trace: Oct 11 12:14:41 node-62 kernel: [220710.327256] [<ffffffff8137d7c1>] ? dump_stack+0x41/0x51 Oct 11 12:14:41 node-62 kernel: [220710.327282] [<ffffffff810d9a27>] ? kmem_cache_destroy+0xcb/0xdf Oct 11 12:14:41 node-62 kernel: [220710.327297] [<ffffffffa03d6d98>] ? iscsi_target_cleanup_module+0x28/0x290 [iscsi_target_mod] Oct 11 12:14:41 node-62 kernel: [220710.327308] [<ffffffffa03d6d70>] ? iscsit_put_transport+0xf/0xf [iscsi_target_mod] Oct 11 12:14:41 node-62 kernel: [220710.327315] [<ffffffff8108442e>] ? SyS_delete_module+0x215/0x299 Oct 11 12:14:41 node-62 kernel: [220710.327321] [<ffffffff8105946c>] ? should_resched+0x5/0x23 Oct 11 12:14:41 node-62 kernel: [220710.327327] [<ffffffff81381b32>] ? page_fault+0x22/0x30 Oct 11 12:14:41 node-62 kernel: [220710.327333] [<ffffffff81386722>] ? system_call_fastpath+0x16/0x1b So the kmem_cache_destroy item indicates that the target service was stopped very soon after the ABORT_TASKs occured. Can you confirm that the target was stopped, and not restarted..? --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html