So I've been struggling with configuring a proper HA stack using DRBD on two dedicated, back-end storage nodes and using KVM on two dedicated, front-end nodes (so four machines total). I'm stuck at just keeping an exported iSCSI LUN consistent for one VM while switching over on the back-end DRBD storage nodes. In my testing, I think I've narrowed it down to KVM's cache setting, but it doesn't make sense and it looks like it will inhibit things later for live migration based on what I've read on this list. So the stack looks something like this. I have the bottom DRBD layer set to use its normal write-back cache settings since both the back-end storage machines have battery backed units for their RAID controllers and I assume that writes only successfully return from the DRBD layer when both the storage controller and the network synchronization protocol (using protocol C of course) return a successful write (to the RAID controllers cache and the DRBD partner). I'm using a primary/secondary setup for the DRBD component. The next layer is the splicing up of the exported DRBD device itself. I'm using nested LVM (not cLVM) for this per the DRBD documentation. It's my understanding that cLVM shouldn't be necessary since the volume group is only active on the primary DRBD node, so no cluster locking should be needed. Hopefully that is correct. On to the iSCSI layer, I'm using tgtd on the target side on each back-end node and iscsid on the initiator side from the front-end nodes. I have the write cache on both the target and initiator disabled as much as I seemingly can. I'm passing the crazy option for this via tgtadm: --- mode_page=8:0:18:0x10:0:0xff:0xff:0:0:0xff:0xff:0xff:0xff:0x80:0x14:0:0:0:0:0:0 since corosync is doing everything within the cluster stack to set up and tear down the iSCSI target and LUN's instead of defining the write-cache off option in /etc/tgt/targets.conf. I confirm that "sdparm --get=WCE" returns: --- WCE 0 [cha: n, def: 0] as expected from the initiator. But I'm still not honestly sure that the target daemon isn't using the page cache on the current primary back-end node. This might be the source of my problem, but documentation on this is sparse and the mode_page above is the best I could find along with the check via sdparm. And finally, there's KVM itself. In order to test all of this, I created a random 1GB file from /dev/urandom on the guest (using RHEL 6 for both the host and the guest). I then copy the random file to a new file and force the current back-end primary node into standby. This successfully restarts the entire stack of components after less than say 10-15 seconds. I have the initiators set to: --- node.session.timeo.replacement_timeout = -1 which should hang forever if I understand the configuration file comments correctly and never report SCSI errors higher up the stack. Anyway, the fail over finishes, I diff the two files and I also do md5sum on each. Now, this is the part where I'm stuck. If I define the virtual disk within KVM to use writethrough cache mode, then while I see a bunch of: --- Buffer I/O error on device dm-0, logical block ... end_request: I/O error, dev vda, sector ... those types of error messages, the cp finishes and the new file seems to be a bit for bit copy of the original. Everything appears to have worked. If I set the cache to none, which apparently I'll need to do anyway for live migration to work (which is the ultimate goal in all of this), then I see the same errors above (although immediately as soon as I initiate the standby operation on the cluster while using writethrough mode, the messages don't show up for a bit) and not only do the files differ typically, it's usually not too long before the ext4 file system sitting on top of vda starts to become very unhappy and gets remounted in read-only mode. So am I missing something here? Using the -1 for the iscsid configuration above, I assumed KVM would never even see any sort of errors at all, but instead would simply hang indefinitely until things came back. Anyone else running a setup like this? Thanks for reading. I can post configuration files as needed or take this to the open-iscsi lists next if KVM doesn't appear to be the issue at this point. -- Mark Nipper nipsy@xxxxxxxxxxxx (XMPP) +1 979 575 3193 - There are 10 kinds of people in the world; those who know binary and those who don't. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html