Re: ceph + vmware

Mike Christie <mchristi@xxxxxxxxxx> · Wed, 20 Jul 2016 15:29:03 -0500

On 07/20/2016 11:52 AM, Jan Schermer wrote:
> 
>> On 20 Jul 2016, at 18:38, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>
>> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the update on the RHCS iSCSI target.
>>>
>>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>>> it too early to say / announce).
>>
>> No HA support for sure. We are looking into non HA support though.
>>
>>>
>>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>>
>>> So we're currently running :
>>>
>>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>>> has all VAAI primitives enabled and run the same configuration.
>>> - RBD images are mapped on each target using the kernel client (so no
>>> RBD cache).
>>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>>> but in a failover manner so that each ESXi always access the same LUN
>>> through one target at a time.
>>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>>> (except UNMAP as per default).
>>>
>>> Do you see anthing risky regarding this configuration ?
>>
>> If you use a application that uses scsi persistent reservations then you
>> could run into troubles, because some apps expect the reservation info
>> to be on the failover nodes as well as the active ones.
>>
>> Depending on the how you do failover and the issue that caused the
>> failover, IO could be stuck on the old active node and cause data
>> corruption. If the initial active node looses its network connectivity
>> and you failover, you have to make sure that the initial active node is
>> fenced off and IO stuck on that node will never be executed. So do
>> something like add it to the ceph monitor blacklist and make sure IO on
>> that node is flushed and failed before unblacklisting it.
>>
> 
> With iSCSI you can't really do hot failover unless you only use synchronous IO.
> (With any of opensource target softwares available).

That is what we are working on adding.

Why did you only say iSCSI though?

> Flushing the buffers doesn't really help because you don't know what in-flight IO happened before the outage

To be clear, when I wrote flush I did not mean cache buffers. I only
meant the targets list of commands.

And, for the unblacklist comment it is best to unmap images that are
under a blacklist then remap them. The osd blacklist remove command
would leave some krbd structs in a bad state.

> and which didn't. You could end with only part of the "transaction" written on persistent storage.
> 

Maybe I am not sure what you mean by hot failover.

If you are failing over for the case where one node just goes
unreachable, then if you blacklist it before making another node active
you know IO that had not been sent will be failed and never execute,
partially sent IO will be failed and not execute. IO that was sent to
the OSD and is executing will completed by the OSD before new IO to the
same sectors, so you would not end up with what looks like partial
transactions if you later did a read.

If the OSDs die mid write you could end up with a part of command
written, but that could happen with any SCSI based protocol.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com