Re: iSCSI target data corruption in vSphere 5

Martin Svec <martin.svec@xxxxxxxx> · Wed, 14 Mar 2012 14:31:43 +0100

Hello Nicholas,

The latest lio-core head passes all my tests related to this bug. Thanks a lot for your fast fix, I was struggling with this bug several days just to find a reproducible case of it :-)

I'm going to test the patches with 3.2 stable series as well.

Martin

Dne 14.3.2012 5:07, Nicholas A. Bellinger napsal(a):
On Tue, 2012-03-13 at 19:20 +0100, Martin Svec wrote:
Hello,

I have a problem with occassional data corruption when using 3.2.x LIO
iSCSI target as a SAN storage in VMware vSphere 5. My tests show that under
special circumstances, some writes to the target seem to be partially lost.
The problem is probably related to VMFS thin provisioning and causes random
BSODs and filesystem corruptions of guests.

Hello Martin,

Thank you very much for reporting this.  I've been able to track the bug
down this afternoon to some iscsi-target specific code for invoking
reservation conflict status that was broken during PYX_TRANSPORT_*
exception refactoring in the v3.2 timeframe.

I've been able to reproduce the issue directly w/o ESX, and am also able
verify the fix.

Please go ahead and update to the following lio-core head:

commit dd9604b1a2558f4c7b8c9f29d0d7cd92ed74ff3e
Author: Nicholas Bellinger<nab@xxxxxxxxxxxxxxx>
Date:   Tue Mar 13 18:20:11 2012 -0700

     iscsi-target: Fix reservation conflict -EBUSY response handling bug

I'm going to queue this up for 3.3-urgent very soon, so please let me
know if you still have problems with commit dd9604b1a25.

Thank you,

--nab

In short, if I have two VMs on two different ESXi hosts that use the same
LUN as a datastore for their VMDK disks, and one of the VMs has a growing
thin-provisioned disk, than concurrent guest disk activity causes that
some writes in the _opposite_ VM are lost.

The following setup seems to reliably reproduce the bug:

(1) Create vSphere 5 cluster environment with two ESXi hosts, ESX1
      and ESX2.
(2) Create two linux virtual machines, VM1 located on ESX1 local disk
      and VM2 located on ESX2 local disk. It's important that they are on
      two different hosts!
(3) Create a clean new shared VMFS5 SAN datastore based on a LUN
      provided by LIO iSCSI target.
(4) Create a 1GB VMDK disk on this LIO datastore and add it to VM1 as /dev/sdb.
(5) Sequentially fill VM1's /dev/sdb with a known pattern, say 4kB "AAAA" blocks.
(6) Re-read VM1 sdb to check that it really contains "AAAA" pattern.
(7) Create a 1GB _thin provisioned_ disk on LIO datastore and add it to VM2.
(8) In VM1, start the fill of /dev/sdb again, with "bbbb" pattern now.
(9) At the same time, start a similar fill in VM2 that writes "cccc" pattern
      to its /dev/sdb. It is important to start the fill in both VMs at the
      same time so that they write to their disks concurrently. At this point,
      VM1 is overwriting its fully allocated disk and VM2 is growing its thin
      disk and filling it.
(10) Re-read VM2 sdb disk - it contains "cccc" pattern, there's no problem.
(11) Re-read VM1 sdb disk - instead of contiguous "bbbb" pattern, there are
      rare occurrences of pieces of the original "AAAA" pattern, which means
      that some of the "bbbb" writes were only partially written or weren't
      written at all!

I'm not sure that this is the only possible scenario but at least it's
100% reproducible for me.

Notes:

(*) Only kernels>=3.2 seem to be affected. I regularly reproduce the bug
with vanilla 3.2.0, stable 3.2.9, and latest vanilla 3.3-rcX from git.
On the other hand, vanilla 3.1.0 is always OK. So the bug was probably
introduced in 3.2.

(*) The bug occurs only during the on-demand grow of thin VMDK disks. When
I repeat the test with VMDKs that are already fully allocated, everything is
OK. As VMDK thin growing involves SCSI-2 reservations for cluster-wide locking,
maybe these reservations in some way interfere with the writes from the other
session?

(*) It is necessary to run the test from _two_ ESXi hosts. If VM1 and VM2 are on
the same host, there is no problem.

(*) The problem is not related to missing WRITE_SAME support in LIO. I perform
all tests with ESXi DataMover.HardwareAcceleratedInit option turned off.

(*) There is no evidence of VMFS5 filesystem metadata corruption, there
are no errors in ESXi and LIO logs.

(*) A sequence of unwritten data always ends on 4kB-aligned offset, regardless of
the pattern size. Which also means that only parts of writes are lost, probably
in 4kB units. On the other hand, gap's start offset depends on the pattern
size. Even if I write blocks of non-power-of-two size, unwritten sequence always
starts after a fully written block. (However, I'm not sure how much is this
affected by the VM guest block layer.)

(*) The number of unwritten gaps is random, but I always get about 15-20 gaps
with the above test.

(*) All tests were performed with LIO iblock device backed by an LVM volume.
I use two 1GE NICs with round-robin path selection policy in each ESXi host.
On target side, there are two network portals too. Jumbo frames are enabled
everywhere.

Does anybody have an idea what's wrong?

Regards

Martin

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Martin Švec
technický ředitel internetových služeb
======================================================
ZONER software, a.s., Nové sady 18,  602 00 Brno, CZ
Tel.(fax): 543 257 244(245), e-mail: martin.svec@xxxxxxxx

www.zoner.eu | www.czechia.com | www.regzone.cz

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html