Hello, First of all i would like to thank your patience... > There is a lot of confusion by newcomers to iSCSI storage. > > A lot of the time they think of iSCSI as yet another > file sharing method, which it isn't, it is a disk sharing > method, and if you allow 2 hosts to access the same disk > without putting special controls in place to make sure > that either 1) only 1 host at a time can access a given > disk, or 2) install a clustering file system that allows > multiple hosts to access the same disk at the same time, > then they will experience data corruption as there is > nothing preventing any two hosts from writing data on > top of each other. I understant.. iscsi has nothing to do with files or filesystems. Iscsi (and scsi for that matter) only work with blocks. If you try to put several machines accessing the same filesystem that is not cluster-aware you'll have lots of corruptions.. > The performance penalty you speak of with blockio being accessed > through a local iSCSI connection should really not be noticed > except for extreme high-end processing, which if that is the > case you are picking the wrong technology. We have bladecenter with FC storage for that :) What we are trying to do is remove "unecessary" load in the msa connected machines as they will be used for virtual machines also. > When you mount an iscsi target locally the open-iscsi initiator > does agressive caching of io, then the file system of the OS > does agressive caching itself, so it's not as if all io becomes > synchronous in this scenario. You are correct but that also happens with 2 open-iscsi initiators accessing the same exported volume in different machines. The only difference is that instead of the msa500 volume being exported directly by iscsi-target there is a middleware (device-mapper) between msa500 volume and the iscsi-target. Device-mapper does not do cache. When we do an fsync in a guest machine it goes: virtual machine fsync -> clvmd/lvm -> iscsi-initiator -> iscsi-target -> device-mapper -> msa500 when the virtual machine is running in the msa500 connected hardware we get virtual machine fsync -> clvmd/lvm -> device-mapper (linear) -> msa500 > Now you can use clvm between the iSCSI targets to manage > how the MSA500 storage is allocated for the creation of > iSCSI targets, but once exported by iSCSI, these servers > should not care about what the initiators put into it > or how they manage it. That would require us to be changing all the time the iscsi-target and initiators confs as well as iscsi discovers and multipath in all the iscsi-initiators machines. When we create a volume to a virtual machine we would have to do: 1 - create volume in clvmd that manages the storage 2 - change ietd.conf to allow it to be exported 3 - discover the new device in initiators 4 - change multipath in initiators including the new volume Drawbacks: 1 - lots of changes in conf files, restarting services :) 2 - Multipath has a patchchecker that checks if a path is alive (usually readblock0). That would give me lots and lots of readblock0.. total checks in msa500 = num client machines * num multipath devices * num iscsi-target machines With 8 machines and 40 volumes we would have: 8 * 40 * 2 = 640 IO checks > > +--------+ <-> |- initiator1 > > | iSCSI1 | | > > +--------+ <-> +--------+ <-> |- initiator2 > > | MSA500 | (2) (3) (4) | (5) > > +--------+ <-> +--------+ <-> |- initiator3 > (1) | iSCSI2 | | > +--------+ <-> |- initiator4 > > 1) MSA500 provides volume1, volume2 to > fiber hosts iSCSI1/iSCSI2 > 2) iSCSI1/iSCSI2 fiber connect to MSA500 > 3) iSCSI1/iSCSI2 use clvm to divvy up > volume1 and volume2 into target1, target2 > target3, target4, target5 to iSCSI network > 4) iSCSI1/iSCSI2 provide targets to iSCSI > network through bonded pairs > 5) initiators use clvm to divvy up target1, > target2, target3... storage for use by Xen > domains. > I hope that helps. We are doing stress tests (bonnie++, ctcs) with our "hack" and so far it never had any problems. We even shutdown one of the iscsi-target nodes there's a small hiccup (as one path failed) but it continues shortly after. We've changed node.session.timeo.replacement_timeout node.conn[0].timeo.noop_out_interval node.conn[0].timeo.noop_out_timeout to increase the speed of the failover.. Thanks again, Nuno Fernandes -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster