Re: Multi-datacenter filesystem

Stefan Kooman <stefan@xxxxxx> · Fri, 13 May 2022 12:24:58 +0200

On 5/13/22 10:42, Daniel Persson wrote:
Hi Team

We have grown out of our current solution, and we plan to migrate to
multiple data centers.

Will there be active Ceph users in each of the data centers? Or is it 
just for storage high availability and geo-redundancy and will there be 
one data center with Ceph users?

Our setup is a mix of radosgw data and filesystem data. But we have many
legacy systems that require a filesystem at the moment, so we will probably
run it for some of our data for at least 3-5 years.

At the moment, we have about 0.5 Petabytes of data, so it is a small
cluster. Still, we want more redundancy, so we will partner with a company
with multiple data centers within the city and have redundant fiber between
the locations.

Our current center has multiple 10GB connections, so the communication
between the new locations and our existing data center will be slower.
Still, I hope the network traffic will suffice for a multi-datacenter setup.

I assume you hope that the network traffic will _not_ suffer from a 
multi-dc setup. What throughput and latency would you get between data 
centers?

Currently, I plan to assign OSDs to different sites and racks so we can
configure a good replication rule to keep a copy of the data in each data
center.

If you want to replicate across data centers (size=3, min_size=2), so 
data center as failure domain, this should be achieved with the 
following crush rule:

ceph osd crush rule create-replicated {name} {root} datacenter [{class}]

My question is how to handle the monitor setup for good redundancy. For
example, should I set up two new monitors in each new location and have one
in our existing data center, so I get five monitors in total, or should I
keep it as three monitors, one for each data center? Or should I go for
nine monitors 3 in each data center?

5 monitors is really nice to have. You can lose one extra monitor 
compared to three. If you hit an issue with a monitor you have to be 
able to fix it with the remaining monitors that are still running. Any 
tuning / restarts on the remaining monitors in order to fix the problem 
will give downtime. More than 5 monitors should not be needed normally. 
But Ceph users with very large clusters might be running with more than 
5, not sure about that (thinking about CERN clusters, STFC Echo).

Should I use a Stretch set up to define the location of each monitor? 

Only if you want to do a dual data center setup (with 2 copies per DC). 
When you got more than 2 data centers, let's say 3, you don't need that. 
In a 3 DC setup With 5 monitors total you would obtain highest 
availability with two data centers with 2 monitors, and one data center 
with 1 monitor.

Could
you do the same for MDS:es? Do I need to configure the mounting of the
filesystem differently to signal in which data center the client is located?

If I recall correctly, Dan from CERN has MDSes placed close to the 
clients, and that helped to improve performance. You would need multiple 
active MDSes and have them balance the load between them, or do (manual) 
directory pinning to lock certain users to a given MDS. There is a lot 
of communication involved between a CephFS client and the MDS, 
especially for metadata operations. Higher Network latency might hurt 
there, so I guess it could be beneficial to optimize that. Depending on 
the workload, and if snapshots are used or not, there might be 
substantial internal MDS communication, defeating the purpose, and / or 
consuming traffic that could have otherwise be used for OSD / client 
traffic.

To come back to my first question: If only storage is distributed 
between data centers but not ceph users, than you might gain read 
performance improvements by putting all primary OSDs in the data center 
where the Ceph users reside. You should be able to obtain that with 
adjusting primary-affinity [1]. All reads (by default) should come from 
the primary OSDs that are all located in the proximity of the ceph users 
with 10 Gb/s connectivity. Writes should still be ack'ed from the OSDs 
in the remote data centers, so I do not expect any gains there.

So there are quite a few things you can take into consideration.

Gr. Stefan

[1]: 
https://docs.ceph.com/en/quincy/rados/operations/crush-map/#primary-affinity

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx