Re: LUNs become unavailable with current git HEAD

Thomas Glanzmann <thomas@xxxxxxxxxxxx> · Sat, 12 Oct 2013 09:28:21 +0200

Hallo Nab,
I'll reproduce this over the weekend, but I'm not sure that I can, but I'll do
my best.

A few more information of things, that we did:

        - Network configuration was as always two bonds with two links each
          with mac hash, one ip per bond. Portal exposed both IPs.

        - We had iSCSI Port binding with round robin on. So we used
          multiple iSCSI sessions per ESX server (4 active per LUN).

        - I dropped the buffer cache of the target to demonstrate how to
          free up memory fast. The following lines are from my scroll back
          buffer of screen:

(node-62) [~/work/linux-2.6] free
             total       used       free     shared    buffers     cached
Mem:      66083628   65708924     374704          0     230056   63309488
-/+ buffers/cache:    2169380   63914248
Swap:            0          0          0
(node-62) [~/work/linux-2.6] /
sithglan has logged on pts/0 from infra-vlan10.gmvl.de
(node-62) [~/work/linux-2.6] sync; echo 3 > /proc/sys/vm/drop_caches
free
(node-62) [~/work/linux-2.6] free
             total       used       free     shared    buffers     cached
Mem:      66083628     547728   65535900          0        980       7616
-/+ buffers/cache:     539132   65544496
Swap:            0          0          0
(node-62) [~/work/linux-2.6] dmesg | tail
[   11.899264] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.101: link becomes ready
[   11.899542] IPv6: ADDRCONF(NETDEV_CHANGE): bond1.102: link becomes ready
[   12.315195] igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   12.398846] bonding: bond1: link status definitely up for interface eth2, 1000 Mbps full duplex.
[   14.587938] Bridge firewalling registered
[   14.665002] Rounding down aligned max_sectors from 4294967295 to 4294967288
[57628.200959] Detected MISCOMPARE for addr: ffff8805702cc000 buf: ffff880c4128a000
[57628.200965] Target/fileio: Send MISCOMPARE check condition and sense
[57628.336304] Detected MISCOMPARE for addr: ffff88066d090000 buf: ffff880c4128a000
[57628.336310] Target/fileio: Send MISCOMPARE check condition and sense
(node-62) [~/work/linux-2.6] exit
(node-62) [~/work/linux-2.6] free
             total       used       free     shared    buffers     cached
Mem:      66083628     600624   65483004          0       2848      83100
-/+ buffers/cache:     514676   65568952
Swap:            0          0          0

However. Afterwards it was stable. And I did this one day before the I/O stall
happened.

        - Shortly before the All Paths Down thing happened we upgraded 8 ESX
          servers from 5.1 GA to the newest 5.1 patch available.

        - We had approx. 36 - 72 GB of static state (virtual machine hard
          disks) on the particular two LUNs in question.

        - While the issue happed or shortly after or shortly before, I deployed
          an 8 GB fully patched w2k3 VM. This was the only incident we had. The
          other four days it was rock stable, no issues whatsoever and we tried
          to stress it by doing the rescans/svMotions.

I don't know if this information helps, but I as I said I'll do my best in
order to reproduce it. And we saw the issue on all 8 ESX servers, everything
locked up until I did rebooted the target. Afterwards everything was fine, of
course we had a few timed out tasks in vCenter but this were only symptoms.

Cheers,
        Thomas
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html