iSCSI target data corruption in vSphere 5

Martin Svec <martin.svec@xxxxxxxx> · Tue, 13 Mar 2012 19:20:03 +0100

Hello,

I have a problem with occassional data corruption when using 3.2.x LIO
iSCSI target as a SAN storage in VMware vSphere 5. My tests show that under
special circumstances, some writes to the target seem to be partially lost.
The problem is probably related to VMFS thin provisioning and causes random
BSODs and filesystem corruptions of guests.

In short, if I have two VMs on two different ESXi hosts that use the same
LUN as a datastore for their VMDK disks, and one of the VMs has a growing
thin-provisioned disk, than concurrent guest disk activity causes that
some writes in the _opposite_ VM are lost.

The following setup seems to reliably reproduce the bug:

(1) Create vSphere 5 cluster environment with two ESXi hosts, ESX1
    and ESX2.
(2) Create two linux virtual machines, VM1 located on ESX1 local disk
    and VM2 located on ESX2 local disk. It's important that they are on
    two different hosts!
(3) Create a clean new shared VMFS5 SAN datastore based on a LUN
    provided by LIO iSCSI target.
(4) Create a 1GB VMDK disk on this LIO datastore and add it to VM1 as /dev/sdb.
(5) Sequentially fill VM1's /dev/sdb with a known pattern, say 4kB "AAAA" blocks.
(6) Re-read VM1 sdb to check that it really contains "AAAA" pattern.
(7) Create a 1GB _thin provisioned_ disk on LIO datastore and add it to VM2.
(8) In VM1, start the fill of /dev/sdb again, with "bbbb" pattern now.
(9) At the same time, start a similar fill in VM2 that writes "cccc" pattern
    to its /dev/sdb. It is important to start the fill in both VMs at the
    same time so that they write to their disks concurrently. At this point,
    VM1 is overwriting its fully allocated disk and VM2 is growing its thin
    disk and filling it.
(10) Re-read VM2 sdb disk - it contains "cccc" pattern, there's no problem.
(11) Re-read VM1 sdb disk - instead of contiguous "bbbb" pattern, there are
    rare occurrences of pieces of the original "AAAA" pattern, which means
    that some of the "bbbb" writes were only partially written or weren't
    written at all!

I'm not sure that this is the only possible scenario but at least it's
100% reproducible for me.

Notes:

(*) Only kernels>=3.2 seem to be affected. I regularly reproduce the bug
with vanilla 3.2.0, stable 3.2.9, and latest vanilla 3.3-rcX from git.
On the other hand, vanilla 3.1.0 is always OK. So the bug was probably
introduced in 3.2.

(*) The bug occurs only during the on-demand grow of thin VMDK disks. When
I repeat the test with VMDKs that are already fully allocated, everything is
OK. As VMDK thin growing involves SCSI-2 reservations for cluster-wide locking,
maybe these reservations in some way interfere with the writes from the other
session?

(*) It is necessary to run the test from _two_ ESXi hosts. If VM1 and VM2 are on
the same host, there is no problem.

(*) The problem is not related to missing WRITE_SAME support in LIO. I perform
all tests with ESXi DataMover.HardwareAcceleratedInit option turned off.

(*) There is no evidence of VMFS5 filesystem metadata corruption, there
are no errors in ESXi and LIO logs.

(*) A sequence of unwritten data always ends on 4kB-aligned offset, regardless of
the pattern size. Which also means that only parts of writes are lost, probably
in 4kB units. On the other hand, gap's start offset depends on the pattern
size. Even if I write blocks of non-power-of-two size, unwritten sequence always
starts after a fully written block. (However, I'm not sure how much is this
affected by the VM guest block layer.)

(*) The number of unwritten gaps is random, but I always get about 15-20 gaps
with the above test.

(*) All tests were performed with LIO iblock device backed by an LVM volume.
I use two 1GE NICs with round-robin path selection policy in each ESXi host.
On target side, there are two network portals too. Jumbo frames are enabled
everywhere.

Does anybody have an idea what's wrong?

Regards

Martin

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html