Re: Offsite replication scenario

Anthony Verevkin <anthony@xxxxxxxxxxx> · Wed, 16 Jan 2019 14:08:52 -0500 (EST)

I would definitely see huge value in going to 3 MONs here (and btw 2 on-site MGR and 2 on-site MDS)
However 350Kbps is quite low and MONs may be latency sensitive, so I suggest you do heavy QoS if you want to use that link for ANYTHING else.
If you do so, make sure your clients are only listing the on-site MONs so they don't try to read from the off-site MON.
Still you risk the stability of the cluster if the off-site MON starts lagging. If it's still considered on while lagging, all changes to cluster (osd going up/down, etc) would be blocked by waiting it to commit.

Even if you choose against an off-site MON, maybe consider 2 on-site MON instead. Yes, you'd double the risk of cluster going to a halt if any one node dies vs one specific node dying. But if that happens you have a manual way of downgrading to a single MON (and you still have your MON's data) vs risking to get stuck with a OSD-only node that had never had MON installed and not having a copy of MON DB.

I also see how you want to get the data out for backups.
Having a third replica off-site definitely won't fly with such bandwidth as it would once again block the IO until committed by the off-site OSD.
I am not quite sure RBD mirroring would play nicely with this kind of link either. Maybe stick with application-level off-site backups.
And again, whatever replication/backup strategy you do, need to QoS or else you'd cripple your connection which I assume is used for some other communications as well.

... or totally unrelated to Ceph, maybe just backup to the local USB drive and have somebody replace it and ship to the head office once a while?

Regards,
Anthony

----- Original Message -----
From: "Brian Topping" <brian.topping@xxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Saturday, January 12, 2019 1:07:05 AM
Subject:  Offsite replication scenario

Hi all,

I have a simple two-node Ceph cluster that I’m comfortable with the care and feeding of. Both nodes are in a single rack and captured in the attached dump, it has two nodes, only one mon, all pools size 2. Due to physical limitations, the primary location can’t move past two nodes at the present time. As far as hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 10GbE. 

My next goal is to add an offsite replica and would like to validate the plan I have in mind. For it’s part, the offsite replica can be considered read-only except for the occasional snapshot in order to run backups to tape. The offsite location is connected with a reliable and secured ~350Kbps WAN link. 

The following presuppositions bear challenge:

* There is only a single mon at the present time, which could be expanded to three with the offsite location. Two mons at the primary location is obviously a lower MTBF than one, but  with a third one on the other side of the WAN, I could create resiliency against *either* a WAN failure or a single node maintenance event. 
* Because there are two mons at the primary location and one at the offsite, the degradation mode for a WAN loss (most likely scenario due to facility support) leaves the primary nodes maintaining the quorum, which is desirable. 
* It’s clear that a WAN failure and a mon failure at the primary location will halt cluster access.
* The CRUSH maps will be managed to reflect the topology change.

If that’s a good capture so far, I’m comfortable with it. What I don’t understand is what to expect in actual use:

* Is the link speed asymmetry between the two primary nodes and the offsite node going to create significant risk or unexpected behaviors?
* Will the performance of the two primary nodes be limited to the speed that the offsite mon can participate? Or will the primary mons correctly calculate they have quorum and keep moving forward under normal operation?
* In the case of an extended WAN outage (and presuming full uptime on primary site mons), would return to full cluster health be simply a matter of time? Are there any limits on how long the WAN could be down if the other two maintain quorum?

I hope I’m asking the right questions here. Any feedback appreciated, including blogs and RTFM pointers.

Thanks for a great product!! I’m really excited for this next frontier!

Brian

> [root@gw01 ~]# ceph -s
>  cluster:
>    id:     nnnn
>    health: HEALTH_OK
> 
>  services:
>    mon: 1 daemons, quorum gw01
>    mgr: gw01(active)
>    mds: cephfs-1/1/1 up  {0=gw01=up:active}
>    osd: 8 osds: 8 up, 8 in
> 
>  data:
>    pools:   3 pools, 380 pgs
>    objects: 172.9 k objects, 11 GiB
>    usage:   30 GiB used, 5.8 TiB / 5.8 TiB avail
>    pgs:     380 active+clean
> 
>  io:
>    client:   612 KiB/s wr, 0 op/s rd, 50 op/s wr
> 
> [root@gw01 ~]# ceph df
> GLOBAL:
>    SIZE        AVAIL       RAW USED     %RAW USED 
>    5.8 TiB     5.8 TiB       30 GiB          0.51 
> POOLS:
>    NAME                ID     USED        %USED     MAX AVAIL     OBJECTS 
>    cephfs_metadata     2      264 MiB         0       2.7 TiB        1085 
>    cephfs_data         3      8.3 GiB      0.29       2.7 TiB      171283 
>    rbd                 4      2.0 GiB      0.07       2.7 TiB         542 
> [root@gw01 ~]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF 
> -1       5.82153 root default                          
> -3       2.91077     host gw01                         
> 0   ssd 0.72769         osd.0     up  1.00000 1.00000 
> 2   ssd 0.72769         osd.2     up  1.00000 1.00000 
> 4   ssd 0.72769         osd.4     up  1.00000 1.00000 
> 6   ssd 0.72769         osd.6     up  1.00000 1.00000 
> -5       2.91077     host gw02                         
> 1   ssd 0.72769         osd.1     up  1.00000 1.00000 
> 3   ssd 0.72769         osd.3     up  1.00000 1.00000 
> 5   ssd 0.72769         osd.5     up  1.00000 1.00000 
> 7   ssd 0.72769         osd.7     up  1.00000 1.00000 
> [root@gw01 ~]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS 
> 0   ssd 0.72769  1.00000 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115 
> 2   ssd 0.72769  1.00000 745 GiB 3.1 GiB 742 GiB 0.42 0.82  83 
> 4   ssd 0.72769  1.00000 745 GiB 3.6 GiB 742 GiB 0.49 0.96  90 
> 6   ssd 0.72769  1.00000 745 GiB 3.5 GiB 742 GiB 0.47 0.93  92 
> 1   ssd 0.72769  1.00000 745 GiB 3.4 GiB 742 GiB 0.46 0.90  76 
> 3   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102 
> 5   ssd 0.72769  1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02  98 
> 7   ssd 0.72769  1.00000 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104 
>                    TOTAL 5.8 TiB  30 GiB 5.8 TiB 0.51          
> MIN/MAX VAR: 0.82/1.29  STDDEV: 0.07
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com