Re: Applicability and migration path

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 10 Aug 2018 13:08:45 +0200

Hi,

just some thoughts and comments:

Hardware:

The default ceph setup uses 3 replicates on three different hosts, so 
you need at least three hosts for a ceph cluster. Other configurations 
with a smaller number of hosts are possible, but not recommended. 
Depending on the workload and access pattern you can also store your 
files on a EC pool which might improve the available capacity.

You also need to consider failure scenarios. With three hosts and a 
default replicated pool (3 replicates) or the smallest EC pool (2+1), 
the failure of any host will put your cluster into an undesired state 
since it is not able to recover without a third active host to replicate 
data to. The minimum number of hosts is thus (size of pools) + (number 
of allowed failed hosts). To be on a somewhat safe side, you need at 
least 4 hosts. More hosts will improve bandwidth, IOPS and reliability.

The default ceph setup is managing each disk separately using a OSD 
daemon. These daemons will consume RAM. With 20 disks in a host, you 
also need a fair amount of RAM. The latest OSD implementation 
(bluestore) does not use the kernel page cache, so each OSD will consume 
RAM independently from the others. With 20 disks pre host you need at 
least 64 GB RAM for a sane setup (according to my gut, other may have 
better numbers). More RAM is always desirable.

Part of the OSD information is stored in a key-value database. It is 
highly advisable to put these databases on SSDs. You can share one SSD 
for several OSDs (e.g. by creating partitions), but keep in mind that 
the failure of one of these SSDs also renders the OSD content useless. 
Do not use consumer grade SSDs. There are many discussions on the 
mailing list about SSDs, just search the archive.

In addition to the OSD daemons a ceph cluster also requires monitoring 
daemons ("mon"). They manage all metainformation needed to operate the 
cluster (hosts running mons, hosts running osds, pool definitions, 
authentication information and other stuff). If you lose your mons, you 
will have a hard time recovering your cluster. You need an odd number of 
mons for a sane quorum setup, so three mons are advisable. Do not run a 
production cluster with one or two mons. The mons also need a small 
chunk of fast, reliable storage, so use enterprise grade SSDs for this. 
Some people also advise not to run mons colocated with osds on the same 
host; we run them colocated with a software raid 1 over two NVME devices 
(that are also used for our OSDs). Depending on your available hardware 
you might want to put at least one mon on a small dedicated host.

If you want to use CephFS, you also need at least one metadata server 
("mds"). For high availability a second standby instance is recommended. 
For large filesystems with a large number of files (or a large number of 
meta operations per second), you can also have several active mds 
servers, as long as you have at least one standby instance. The mds 
servers consume a large amount of RAM depending on their configuration 
and the number of files open simultaneously. Our mds servers use about 7 
GB RAM with ~ 2 million cached inodes and ~ 1 million capabilities 
(ownership/locking information). Again it might not be advisable to run 
mds server colocated with osds.

Low latency network is highly recommended between ceph cluster hosts and 
the ceph clients. Use 10 GBit or better; 1 GBit works but the 
performance will be bad. Also consider redundancy (e.g. LACP bonding of 
links). The ceph cluster can use a public network for client access and 
an internal network for the replication between osd hosts. Unless you 
have special requirements I would recommend not to use a separate 
internal network due to the higher configuration efforts. All ceph hosts 
must be able to contact each other, and all ceph clients must be able to 
contact the ceph hosts. As an example, reading a file from a CephFS 
filesystem requires contacting a mon to retrieve the current mds, osd 
and pool information (once during mount), contacting the mds to retrieve 
the metadata for the file, and finally getting the data from the osd hosts.

Comparison to NFS:

There are a number of important differences to a standard NFS setup. 
CephFS uses POSIX semantics, so file locks are enforced. Every file 
access results in a roundtrip to the mds first to acquire the 
"capabilities" to access the file. If the file is currently in use by 
another client, the mds might contact that client and ask it to release 
its capabilities (e.g. after the file was closed, but is still present 
in the page cache). Applications relying on the less stringent NFS 
semantics might have a severe performance impact.

CephFS also does not perform any mds side authorization based on unix 
permissions (AFAIK). The access to the mds (and thus the filesystem) is 
controlled by a shared ceph user secret. You do not have the ability to 
use kerberos or server side unix group permissions checks. And you need 
to trust the clients since you need to store the ceph user secret on the 
client.

You can export a CephFS filesystem via NFS either by re-exporting a 
mountpoint or using ganesha NFS with native cephfs support. But the NFS 
server will become the bottleneck and single point of failure in this 
case. CephFS on the clients is the recommended setup.

Migration:

As mentioned above, you need at least three hosts in an initial setup 
and probably some hardware upgrade (RAM, SSDs). Do not be mislead by the 
possibility to setup a single host cluster; I won't consider such a 
setup even temporarly for migration purposes. It's an invitation for 
Murphy to strike...

If you cannot free three hosts completely, you can also run both setup 
side by side. Start with a small number of disks you can remove from the 
raid setups on each host, convert them to ceph osds, migrate data 
between the filesystems, proceed with more disks until all data is 
migrated. One important aspect you need to consider is the fact that you 
cannot change the number of coding/data chunks in a EC pool. If you want 
to use a EC pool for the filesystem, you need to create it with the 
number of coding/data chunks you want to have in the final setup.

Speed:

A file is automatically split up into chunks of up to 4 MB size; each 
chunk is mapped to a placement group in a cephfs data pool; a placement 
group is mapped to a configurable number of OSDs (e.g. three OSDs on 
three different hosts for a default replicated pool). One instance of 
the placement group is the primary one; all IO operations are sent to 
the OSD having this instance. In case of a write operation, the OSD will 
pass the operation to the other OSDs involved; in case of a read 
operation it either reads the data from disks and sends it back to the 
client (replicated pool), or collects all data chunks from the other 
OSDs, merges them and sends the data back to the client (EC pool). In 
the end, you read the data from a single disk. You can expect the 
performance of a single disk, which is usually worse than a full raid 
array. Ceph is not tuned for fast single IO operations, but it scales 
well with the number of ceph cluster hosts and ceph clients. Whether it 
is fast enough depends on your workload and access pattern. Large files 
may also benefit from a large readahead setting, resulting in parallel 
access to multiple OSD hosts.

Reliabilty:

This is what ceph is designed for. A sane ceph setup (replicated pool 
with three replicates, HA mon, HA mds) is almost indestructible. But you 
need a good monitoring and follow certain rules like always keeping 
enough free capacity for failure recovery.

TL;DR:

If you have enough hosts and the correct hardware configuration for the 
hosts and you are able to either convert complete hosts or indivual 
disks to ceph, you should be able to migrate. Whether it is worth the 
effort depends on your workload and IO requirements. I would highly 
recommend to setup a test cluster before to get used to ceph 
configuration and operations.

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com