On Fri, Jan 11, 2019 at 10:07 PM Brian Topping <brian.topping@xxxxxxxxx> wrote:
Hi all,
I have a simple two-node Ceph cluster that I’m comfortable with the care and feeding of. Both nodes are in a single rack and captured in the attached dump, it has two nodes, only one mon, all pools size 2. Due to physical limitations, the primary location can’t move past two nodes at the present time. As far as hardware, those two nodes are 18-core Xeon with 128GB RAM and connected with 10GbE.
My next goal is to add an offsite replica and would like to validate the plan I have in mind. For it’s part, the offsite replica can be considered read-only except for the occasional snapshot in order to run backups to tape. The offsite location is connected with a reliable and secured ~350Kbps WAN link.
Unfortunately this is just not going to work. All writes to a Ceph OSD are replicated synchronously to every replica, all reads are served from the primary OSD for any given piece of data, and unless you do some hackery on your CRUSH map each of your 3 OSD nodes is going to be a primary for about 1/3 of the total data.
If you want to move your data off-site asynchronously, there are various options for doing that in RBD (either periodic snapshots and export-diff, or by maintaining a journal and streaming it out) and RGW (with the multi-site stuff). But you're not going to be successful trying to stretch a Ceph cluster over that link.
-Greg
The following presuppositions bear challenge:
* There is only a single mon at the present time, which could be expanded to three with the offsite location. Two mons at the primary location is obviously a lower MTBF than one, but with a third one on the other side of the WAN, I could create resiliency against *either* a WAN failure or a single node maintenance event.
* Because there are two mons at the primary location and one at the offsite, the degradation mode for a WAN loss (most likely scenario due to facility support) leaves the primary nodes maintaining the quorum, which is desirable.
* It’s clear that a WAN failure and a mon failure at the primary location will halt cluster access.
* The CRUSH maps will be managed to reflect the topology change.
If that’s a good capture so far, I’m comfortable with it. What I don’t understand is what to expect in actual use:
* Is the link speed asymmetry between the two primary nodes and the offsite node going to create significant risk or unexpected behaviors?
* Will the performance of the two primary nodes be limited to the speed that the offsite mon can participate? Or will the primary mons correctly calculate they have quorum and keep moving forward under normal operation?
* In the case of an extended WAN outage (and presuming full uptime on primary site mons), would return to full cluster health be simply a matter of time? Are there any limits on how long the WAN could be down if the other two maintain quorum?
I hope I’m asking the right questions here. Any feedback appreciated, including blogs and RTFM pointers.
Thanks for a great product!! I’m really excited for this next frontier!
Brian
> [root@gw01 ~]# ceph -s
> cluster:
> id: nnnn
> health: HEALTH_OK
>
> services:
> mon: 1 daemons, quorum gw01
> mgr: gw01(active)
> mds: cephfs-1/1/1 up {0=gw01=up:active}
> osd: 8 osds: 8 up, 8 in
>
> data:
> pools: 3 pools, 380 pgs
> objects: 172.9 k objects, 11 GiB
> usage: 30 GiB used, 5.8 TiB / 5.8 TiB avail
> pgs: 380 active+clean
>
> io:
> client: 612 KiB/s wr, 0 op/s rd, 50 op/s wr
>
> [root@gw01 ~]# ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 5.8 TiB 5.8 TiB 30 GiB 0.51
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> cephfs_metadata 2 264 MiB 0 2.7 TiB 1085
> cephfs_data 3 8.3 GiB 0.29 2.7 TiB 171283
> rbd 4 2.0 GiB 0.07 2.7 TiB 542
> [root@gw01 ~]# ceph osd tree
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 5.82153 root default
> -3 2.91077 host gw01
> 0 ssd 0.72769 osd.0 up 1.00000 1.00000
> 2 ssd 0.72769 osd.2 up 1.00000 1.00000
> 4 ssd 0.72769 osd.4 up 1.00000 1.00000
> 6 ssd 0.72769 osd.6 up 1.00000 1.00000
> -5 2.91077 host gw02
> 1 ssd 0.72769 osd.1 up 1.00000 1.00000
> 3 ssd 0.72769 osd.3 up 1.00000 1.00000
> 5 ssd 0.72769 osd.5 up 1.00000 1.00000
> 7 ssd 0.72769 osd.7 up 1.00000 1.00000
> [root@gw01 ~]# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 0 ssd 0.72769 1.00000 745 GiB 4.9 GiB 740 GiB 0.66 1.29 115
> 2 ssd 0.72769 1.00000 745 GiB 3.1 GiB 742 GiB 0.42 0.82 83
> 4 ssd 0.72769 1.00000 745 GiB 3.6 GiB 742 GiB 0.49 0.96 90
> 6 ssd 0.72769 1.00000 745 GiB 3.5 GiB 742 GiB 0.47 0.93 92
> 1 ssd 0.72769 1.00000 745 GiB 3.4 GiB 742 GiB 0.46 0.90 76
> 3 ssd 0.72769 1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02 102
> 5 ssd 0.72769 1.00000 745 GiB 3.9 GiB 741 GiB 0.52 1.02 98
> 7 ssd 0.72769 1.00000 745 GiB 4.0 GiB 741 GiB 0.54 1.06 104
> TOTAL 5.8 TiB 30 GiB 5.8 TiB 0.51
> MIN/MAX VAR: 0.82/1.29 STDDEV: 0.07
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com