Re: Ceph for multi-site operation

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Mon, 24 Aug 2015 16:34:56 +0200

Le 24/08/2015 15:11, Julien Escario a écrit :
> Hello,
> First, let me advise I'm really a noob with Cephsince I have only read some
> documentation.
>
> I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is
> based on 3 (more if necessary) hypervisors running proxmox 3.4.
>
> Before going futher, I have an essential question : is Ceph usable in a case of
> multiple sites storage ?

It depends on what you really need it to do (access patterns and
behaviour when a link goes down).

>
> Long story :
> My goal is to run hypervisors on 2 datacenters separated by 4ms latency.

Note : unless you are studying Ceph behaviour in this case this goal is
in fact a method to reach a goal. If you describe the actual goal you
might get different suggestions.

> Bandwidth is 1Gbps actually but will be upgraded in a near future.
>
> So is it possible to run a an active/active Ceph cluster to get a shared storage
> between the two sites.

It is but it probably won't behave correctly in your case. The latency
and the bandwidth will hurt a lot. Any application requiring that data
is confirmed stored on disk will be hit by the 4ms latency and 1Gbps
will have to be shared between inter-site replication traffic and
regular VM disk accesses. Your storage will most probably behave like a
very slow single hard drive shared between all your VMs.
Some workloads might work correctly (if you don't have any significant
writes and most of your data will fit in caches for example).

When the link between your 2 datacenters is severed, in the worst case
(no quorum reachable or a crushmap that won't allow each pg to reach
min_size with only one datacenter) everything will freeze, in the best
case (giving priority to a single datacenter by running more monitors on
it and a crushmap storing at least min_size replicas on it) when the
link will be going down everything will run on this datacenter.

You can get around a part of the performance problems by going with a
3-way replication, 2 replicas on your primary datacenter and 1 on the
secondary where all OSD are configured with primary affinity 0. All
reads will be served from the primary datacenter and only writes would
go to the secondary. You'll have to run all your VM on the primary
datacenter and setup your monitors such that the elected master will be
in the primary datacenter (I believe it is chosen by the first name
according to alphabetical order). You'll have a copy of your data on the
secondary datacenter in case of a disaster on the primary but recovering
will be hard (you'll have to reach a quorum of monitors in the secondary
datacenter and I'm not sure how to proceed if you only have one out of 3
for example).

>  Of course, I'll have to be sure that no machien is
> running at the same time on both sites.

With your bandwidth and latency, without knowing more about your
workloads it's probable that running VM on both sites will get you very
slow IOs. Multi datacenter for simple object storage using RGW seems to
work, but RBD volumes accesses are usually more demanding.

>  Hypervisor will be in charge of this.
>
> Is there a mean to ask Ceph to keep at least one copy (or two) in each site and
> ask it to make all blocs reads from the nearest location ?
> I'm aware that writes would have to be replicated and there's only a synchronous
> mode for this.
>
> I've read many documentation and use cases about Ceph and it seems some are
> saying it could be used in such replication and others are not. Need of erasure
> coding isn't clear too.

Don't use erasure coding for RBD volumes. You'll need a caching tier and
it seems tricky to get right and might not be fully tested (I've seen a
snapshot bug discussed here last week).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com