Re: ceph + vmware

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 20 Jul 2016 18:52:10 +0200

> On 20 Jul 2016, at 18:38, Mike Christie <mchristi@xxxxxxxxxx> wrote:
> 
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for the update on the RHCS iSCSI target.
>> 
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
> 
> No HA support for sure. We are looking into non HA support though.
> 
>> 
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>> 
>> So we're currently running :
>> 
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>> 
>> Do you see anthing risky regarding this configuration ?
> 
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
> 
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
> 

With iSCSI you can't really do hot failover unless you only use synchronous IO.
(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without client support
(you essentialy have to do something like transactions and replay the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) without making it synchronous all the way.

The one time I had to use it I resorted to simply mirroring in via mdraid on the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the end.

Jan

> 
>> 
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
> 
> I can't say, because I have not used stgt with rbd bs-type support enough.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com