problem with DRBD backed iSCSI storage pool and KVM guests

Mark Nipper <nipsy@xxxxxxxxxxxx> · Mon, 9 May 2011 17:18:59 -0500

	So I've been struggling with configuring a proper HA
stack using DRBD on two dedicated, back-end storage nodes and
using KVM on two dedicated, front-end nodes (so four machines
total).

	I'm stuck at just keeping an exported iSCSI LUN
consistent for one VM while switching over on the back-end DRBD
storage nodes.

	In my testing, I think I've narrowed it down to KVM's
cache setting, but it doesn't make sense and it looks like it
will inhibit things later for live migration based on what I've
read on this list.

	So the stack looks something like this.  I have the
bottom DRBD layer set to use its normal write-back cache settings
since both the back-end storage machines have battery backed
units for their RAID controllers and I assume that writes only
successfully return from the DRBD layer when both the storage
controller and the network synchronization protocol (using
protocol C of course) return a successful write (to the RAID
controllers cache and the DRBD partner).  I'm using a
primary/secondary setup for the DRBD component.

	The next layer is the splicing up of the exported DRBD
device itself.  I'm using nested LVM (not cLVM) for this per the
DRBD documentation.  It's my understanding that cLVM shouldn't be
necessary since the volume group is only active on the primary
DRBD node, so no cluster locking should be needed.  Hopefully
that is correct.

	On to the iSCSI layer, I'm using tgtd on the target side
on each back-end node and iscsid on the initiator side from the
front-end nodes.  I have the write cache on both the target and
initiator disabled as much as I seemingly can.  I'm passing the
crazy option for this via tgtadm:
---
mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0

since corosync is doing everything within the cluster stack to
set up and tear down the iSCSI target and LUN's instead of
defining the write-cache off option in /etc/tgt/targets.conf.  I
confirm that "sdparm --get=WCE" returns:
---
WCE         0  [cha: n, def:  0]

as expected from the initiator.  But I'm still not honestly sure
that the target daemon isn't using the page cache on the current
primary back-end node.  This might be the source of my problem,
but documentation on this is sparse and the mode_page above is
the best I could find along with the check via sdparm.

	And finally, there's KVM itself.  In order to test all of
this, I created a random 1GB file from /dev/urandom on the guest
(using RHEL 6 for both the host and the guest).  I then copy the
random file to a new file and force the current back-end primary
node into standby.  This successfully restarts the entire stack
of components after less than say 10-15 seconds.  I have the
initiators set to:
---
node.session.timeo.replacement_timeout = -1

which should hang forever if I understand the configuration file
comments correctly and never report SCSI errors higher up the
stack.

	Anyway, the fail over finishes, I diff the two files and
I also do md5sum on each.  Now, this is the part where I'm stuck.
If I define the virtual disk within KVM to use writethrough cache
mode, then while I see a bunch of:
---
Buffer I/O error on device dm-0, logical block ...
end_request: I/O error, dev vda, sector ...

those types of error messages, the cp finishes and the new file
seems to be a bit for bit copy of the original.  Everything
appears to have worked.

	If I set the cache to none, which apparently I'll need to
do anyway for live migration to work (which is the ultimate goal
in all of this), then I see the same errors above (although
immediately as soon as I initiate the standby operation on the
cluster while using writethrough mode, the messages don't show up
for a bit) and not only do the files differ typically, it's
usually not too long before the ext4 file system sitting on top
of vda starts to become very unhappy and gets remounted in
read-only mode.

	So am I missing something here?  Using the -1 for the
iscsid configuration above, I assumed KVM would never even see
any sort of errors at all, but instead would simply hang
indefinitely until things came back.  Anyone else running a setup
like this?

	Thanks for reading.  I can post configuration files as
needed or take this to the open-iscsi lists next if KVM doesn't
appear to be the issue at this point.

-- 
Mark Nipper
nipsy@xxxxxxxxxxxx (XMPP)
+1 979 575 3193
-
There are 10 kinds of people in the world; those who know binary
and those who don't.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html