Re: Offsite replication scenario

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 14 Jan 2019 11:30:59 -0800

On Fri, Jan 11, 2019 at 10:07 PM Brian Topping <brian.topping@xxxxxxxxx> wrote:
Hi all,

I have a simple two-node Ceph cluster that I’m comfortable with the care and feeding of. Both nodes are in a single rack and captured in the attached dump, it has two nodes, only one mon, all pools size 2. Due to physical limitations, the primary location can’t move past two nodes at the present time. As far as hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 10GbE. 

My next goal is to add an offsite replica and would like to validate the plan I have in mind. For it’s part, the offsite replica can be considered read-only except for the occasional snapshot in order to run backups to tape. The offsite location is connected with a reliable and secured ~350Kbps WAN link. 

Unfortunately this is just not going to work. All writes to a Ceph OSD are replicated synchronously to every replica, all reads are served from the primary OSD for any given piece of data, and unless you do some hackery on your CRUSH map each of your 3 OSD nodes is going to be a primary for about 1/3 of the total data.

If you want to move your data off-site asynchronously, there are various options for doing that in RBD (either periodic snapshots and export-diff, or by maintaining a journal and streaming it out) and RGW (with the multi-site stuff). But you're not going to be successful trying to stretch a Ceph cluster over that link.
-Greg

The following presuppositions bear challenge:

* There is only a single mon at the present time, which could be expanded to three with the offsite location. Two mons at the primary location is obviously a lower MTBF than one, but  with a third one on the other side of the WAN, I could create resiliency against *either* a WAN failure or a single node maintenance event. 

* Because there are two mons at the primary location and one at the offsite, the degradation mode for a WAN loss (most likely scenario due to facility support) leaves the primary nodes maintaining the quorum, which is desirable. 

* It’s clear that a WAN failure and a mon failure at the primary location will halt cluster access.

* The CRUSH maps will be managed to reflect the topology change.

If that’s a good capture so far, I’m comfortable with it. What I don’t understand is what to expect in actual use:

* Is the link speed asymmetry between the two primary nodes and the offsite node going to create significant risk or unexpected behaviors?

* Will the performance of the two primary nodes be limited to the speed that the offsite mon can participate? Or will the primary mons correctly calculate they have quorum and keep moving forward under normal operation?

* In the case of an extended WAN outage (and presuming full uptime on primary site mons), would return to full cluster health be simply a matter of time? Are there any limits on how long the WAN could be down if the other two maintain quorum?

I hope I’m asking the right questions here. Any feedback appreciated, including blogs and RTFM pointers.

Thanks for a great product!! I’m really excited for this next frontier!

Brian

> [root@gw01 ~]# ceph -s

>  cluster:

>    id:     nnnn

>    health: HEALTH_OK

> 

>  services:

>    mon: 1 daemons, quorum gw01

>    mgr: gw01(active)

>    mds: cephfs-1/1/1 up  {0=gw01=up:active}

>    osd: 8 osds: 8 up, 8 in

> 

>  data:

>    pools:   3 pools, 380 pgs

>    objects: 172.9 k objects, 11 GiB

>    usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail

>    pgs:     380 active+clean

> 

>  io:

>    client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr

> 

> [root@gw01 ~]# ceph df

> GLOBAL:

>    SIZE        AVAIL       RAW USED     %RAW USED 

>    5.8 TiB     5.8 TiB       30 GiB          0.51 

> POOLS:

>    NAME                ID     USED        %USED     MAX AVAIL     OBJECTS 

>    cephfs_metadata     2      264 MiB         0       2.7 TiB        1085 

>    cephfs_data         3      8.3 GiB      0.29       2.7 TiB      171283 

>    rbd                 4      2.0 GiB      0.07       2.7 TiB         542 

> [root@gw01 ~]# ceph osd tree

> ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF 

> -1       5.82153 root default                          

> -3       2.91077     host gw01                         

> 0   ssd 0.72769         osd.0     up  1.00000 1.00000 

> 2   ssd 0.72769         osd.2     up  1.00000 1.00000 

> 4   ssd 0.72769         osd.4     up  1.00000 1.00000 

> 6   ssd 0.72769         osd.6     up  1.00000 1.00000 

> -5       2.91077     host gw02                         

> 1   ssd 0.72769         osd.1     up  1.00000 1.00000 

> 3   ssd 0.72769         osd.3     up  1.00000 1.00000 

> 5   ssd 0.72769         osd.5     up  1.00000 1.00000 

> 7   ssd 0.72769         osd.7     up  1.00000 1.00000 

> [root@gw01 ~]# ceph osd df

> ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS 

> 0   ssd 0.72769  1.00000 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115 

> 2   ssd 0.72769  1.00000 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83 

> 4   ssd 0.72769  1.00000 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90 

> 6   ssd 0.72769  1.00000 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92 

> 1   ssd 0.72769  1.00000 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76 

> 3   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102 

> 5   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98 

> 7   ssd 0.72769  1.00000 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104 

>                    TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51          

> MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com