On Tue, 2012-03-13 at 19:20 +0100, Martin Svec wrote: > Hello, > > I have a problem with occassional data corruption when using 3.2.x LIO > iSCSI target as a SAN storage in VMware vSphere 5. My tests show that under > special circumstances, some writes to the target seem to be partially lost. > The problem is probably related to VMFS thin provisioning and causes random > BSODs and filesystem corruptions of guests. > Hello Martin, Thank you very much for reporting this. I've been able to track the bug down this afternoon to some iscsi-target specific code for invoking reservation conflict status that was broken during PYX_TRANSPORT_* exception refactoring in the v3.2 timeframe. I've been able to reproduce the issue directly w/o ESX, and am also able verify the fix. Please go ahead and update to the following lio-core head: commit dd9604b1a2558f4c7b8c9f29d0d7cd92ed74ff3e Author: Nicholas Bellinger <nab@xxxxxxxxxxxxxxx> Date: Tue Mar 13 18:20:11 2012 -0700 iscsi-target: Fix reservation conflict -EBUSY response handling bug I'm going to queue this up for 3.3-urgent very soon, so please let me know if you still have problems with commit dd9604b1a25. Thank you, --nab > In short, if I have two VMs on two different ESXi hosts that use the same > LUN as a datastore for their VMDK disks, and one of the VMs has a growing > thin-provisioned disk, than concurrent guest disk activity causes that > some writes in the _opposite_ VM are lost. > > The following setup seems to reliably reproduce the bug: > > (1) Create vSphere 5 cluster environment with two ESXi hosts, ESX1 > and ESX2. > (2) Create two linux virtual machines, VM1 located on ESX1 local disk > and VM2 located on ESX2 local disk. It's important that they are on > two different hosts! > (3) Create a clean new shared VMFS5 SAN datastore based on a LUN > provided by LIO iSCSI target. > (4) Create a 1GB VMDK disk on this LIO datastore and add it to VM1 as /dev/sdb. > (5) Sequentially fill VM1's /dev/sdb with a known pattern, say 4kB "AAAA" blocks. > (6) Re-read VM1 sdb to check that it really contains "AAAA" pattern. > (7) Create a 1GB _thin provisioned_ disk on LIO datastore and add it to VM2. > (8) In VM1, start the fill of /dev/sdb again, with "bbbb" pattern now. > (9) At the same time, start a similar fill in VM2 that writes "cccc" pattern > to its /dev/sdb. It is important to start the fill in both VMs at the > same time so that they write to their disks concurrently. At this point, > VM1 is overwriting its fully allocated disk and VM2 is growing its thin > disk and filling it. > (10) Re-read VM2 sdb disk - it contains "cccc" pattern, there's no problem. > (11) Re-read VM1 sdb disk - instead of contiguous "bbbb" pattern, there are > rare occurrences of pieces of the original "AAAA" pattern, which means > that some of the "bbbb" writes were only partially written or weren't > written at all! > > I'm not sure that this is the only possible scenario but at least it's > 100% reproducible for me. > > Notes: > > (*) Only kernels>=3.2 seem to be affected. I regularly reproduce the bug > with vanilla 3.2.0, stable 3.2.9, and latest vanilla 3.3-rcX from git. > On the other hand, vanilla 3.1.0 is always OK. So the bug was probably > introduced in 3.2. > > (*) The bug occurs only during the on-demand grow of thin VMDK disks. When > I repeat the test with VMDKs that are already fully allocated, everything is > OK. As VMDK thin growing involves SCSI-2 reservations for cluster-wide locking, > maybe these reservations in some way interfere with the writes from the other > session? > > (*) It is necessary to run the test from _two_ ESXi hosts. If VM1 and VM2 are on > the same host, there is no problem. > > (*) The problem is not related to missing WRITE_SAME support in LIO. I perform > all tests with ESXi DataMover.HardwareAcceleratedInit option turned off. > > (*) There is no evidence of VMFS5 filesystem metadata corruption, there > are no errors in ESXi and LIO logs. > > (*) A sequence of unwritten data always ends on 4kB-aligned offset, regardless of > the pattern size. Which also means that only parts of writes are lost, probably > in 4kB units. On the other hand, gap's start offset depends on the pattern > size. Even if I write blocks of non-power-of-two size, unwritten sequence always > starts after a fully written block. (However, I'm not sure how much is this > affected by the VM guest block layer.) > > (*) The number of unwritten gaps is random, but I always get about 15-20 gaps > with the above test. > > (*) All tests were performed with LIO iblock device backed by an LVM volume. > I use two 1GE NICs with round-robin path selection policy in each ESXi host. > On target side, there are two network portals too. Jumbo frames are enabled > everywhere. > > Does anybody have an idea what's wrong? > > Regards > > Martin > > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html