Re: iSCSI target data corruption in vSphere 5

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 13 Mar 2012 21:07:26 -0700

On Tue, 2012-03-13 at 19:20 +0100, Martin Svec wrote:
> Hello,
> 
> I have a problem with occassional data corruption when using 3.2.x LIO
> iSCSI target as a SAN storage in VMware vSphere 5. My tests show that under
> special circumstances, some writes to the target seem to be partially lost.
> The problem is probably related to VMFS thin provisioning and causes random
> BSODs and filesystem corruptions of guests.
> 

Hello Martin,

Thank you very much for reporting this.  I've been able to track the bug
down this afternoon to some iscsi-target specific code for invoking
reservation conflict status that was broken during PYX_TRANSPORT_*
exception refactoring in the v3.2 timeframe.

I've been able to reproduce the issue directly w/o ESX, and am also able
verify the fix.

Please go ahead and update to the following lio-core head:

commit dd9604b1a2558f4c7b8c9f29d0d7cd92ed74ff3e
Author: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx>
Date:   Tue Mar 13 18:20:11 2012 -0700

    iscsi-target: Fix reservation conflict -EBUSY response handling bug

I'm going to queue this up for 3.3-urgent very soon, so please let me
know if you still have problems with commit dd9604b1a25.

Thank you,

--nab

> In short, if I have two VMs on two different ESXi hosts that use the same
> LUN as a datastore for their VMDK disks, and one of the VMs has a growing
> thin-provisioned disk, than concurrent guest disk activity causes that
> some writes in the _opposite_ VM are lost.
> 
> The following setup seems to reliably reproduce the bug:
> 
> (1) Create vSphere 5 cluster environment with two ESXi hosts, ESX1
>      and ESX2.
> (2) Create two linux virtual machines, VM1 located on ESX1 local disk
>      and VM2 located on ESX2 local disk. It's important that they are on
>      two different hosts!
> (3) Create a clean new shared VMFS5 SAN datastore based on a LUN
>      provided by LIO iSCSI target.
> (4) Create a 1GB VMDK disk on this LIO datastore and add it to VM1 as /dev/sdb.
> (5) Sequentially fill VM1's /dev/sdb with a known pattern, say 4kB "AAAA" blocks.
> (6) Re-read VM1 sdb to check that it really contains "AAAA" pattern.
> (7) Create a 1GB _thin provisioned_ disk on LIO datastore and add it to VM2.
> (8) In VM1, start the fill of /dev/sdb again, with "bbbb" pattern now.
> (9) At the same time, start a similar fill in VM2 that writes "cccc" pattern
>      to its /dev/sdb. It is important to start the fill in both VMs at the
>      same time so that they write to their disks concurrently. At this point,
>      VM1 is overwriting its fully allocated disk and VM2 is growing its thin
>      disk and filling it.
> (10) Re-read VM2 sdb disk - it contains "cccc" pattern, there's no problem.
> (11) Re-read VM1 sdb disk - instead of contiguous "bbbb" pattern, there are
>      rare occurrences of pieces of the original "AAAA" pattern, which means
>      that some of the "bbbb" writes were only partially written or weren't
>      written at all!
> 
> I'm not sure that this is the only possible scenario but at least it's
> 100% reproducible for me.
> 
> Notes:
> 
> (*) Only kernels>=3.2 seem to be affected. I regularly reproduce the bug
> with vanilla 3.2.0, stable 3.2.9, and latest vanilla 3.3-rcX from git.
> On the other hand, vanilla 3.1.0 is always OK. So the bug was probably
> introduced in 3.2.
> 
> (*) The bug occurs only during the on-demand grow of thin VMDK disks. When
> I repeat the test with VMDKs that are already fully allocated, everything is
> OK. As VMDK thin growing involves SCSI-2 reservations for cluster-wide locking,
> maybe these reservations in some way interfere with the writes from the other
> session?
> 
> (*) It is necessary to run the test from _two_ ESXi hosts. If VM1 and VM2 are on
> the same host, there is no problem.
> 
> (*) The problem is not related to missing WRITE_SAME support in LIO. I perform
> all tests with ESXi DataMover.HardwareAcceleratedInit option turned off.
> 
> (*) There is no evidence of VMFS5 filesystem metadata corruption, there
> are no errors in ESXi and LIO logs.
> 
> (*) A sequence of unwritten data always ends on 4kB-aligned offset, regardless of
> the pattern size. Which also means that only parts of writes are lost, probably
> in 4kB units. On the other hand, gap's start offset depends on the pattern
> size. Even if I write blocks of non-power-of-two size, unwritten sequence always
> starts after a fully written block. (However, I'm not sure how much is this
> affected by the VM guest block layer.)
> 
> (*) The number of unwritten gaps is random, but I always get about 15-20 gaps
> with the above test.
> 
> (*) All tests were performed with LIO iblock device backed by an LVM volume.
> I use two 1GE NICs with round-robin path selection policy in each ESXi host.
> On target side, there are two network portals too. Jumbo frames are enabled
> everywhere.
> 
> Does anybody have an idea what's wrong?
> 
> Regards
> 
> Martin
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html