Wow.. thanks for such a detailed reply!
On 10 August 2018 at 07:08, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
The default ceph setup uses 3 replicates on three different hosts, so you need at least three hosts for a ceph cluster. Other configurations with a smaller number of hosts are possible, but not recommended. Depending on the workload and access pattern you can also store your files on a EC pool which might improve the available capacity.
We're starting with five large file servers, so the running cluster, once converted, should be fine. However, it sounds like I might have problems with the migration path. More below.
The default ceph setup is managing each disk separately using a OSD daemon. These daemons will consume RAM. With 20 disks in a host, you also need a fair amount of RAM. The latest OSD implementation (bluestore) does not use the kernel page cache, so each OSD will consume RAM independently from the others. With 20 disks pre host you need at least 64 GB RAM for a sane setup (according to my gut, other may have better numbers). More RAM is always desirable.
I think we're fine here. ZFS is a large memory consumer as well, and we're already set up for that.
Part of the OSD information is stored in a key-value database. It is highly advisable to put these databases on SSDs. You can share one SSD for several OSDs (e.g. by creating partitions), but keep in mind that the failure of one of these SSDs also renders the OSD content useless. Do not use consumer grade SSDs. There are many discussions on the mailing list about SSDs, just search the archive.
You're referring to the journal, here? Yes, I'd read the Hardware Recommendations document that suggests that. It doesn't seem to suggest that partitioning of the SSD is necessary, though.. but possible if desired. I haven't (yet) found any recommendations on sizing an SSD, and I wonder if I can take that to mean that the journal is so small that size is rarely a concern.
In addition to the OSD daemons a ceph cluster also requires monitoring daemons ("mon"). They manage all metainformation needed to operate the cluster (hosts running mons, hosts running osds, pool definitions, authentication information and other stuff). If you lose your mons, you will have a hard time recovering your cluster. You need an odd number of mons for a sane quorum setup, so three mons are advisable. Do not run a production cluster with one or two mons. The mons also need a small chunk of fast, reliable storage, so use enterprise grade SSDs for this. Some people also advise not to run mons colocated with osds on the same host; we run them colocated with a software raid 1 over two NVME devices (that are also used for our OSDs). Depending on your available hardware you might want to put at least one mon on a small dedicated host.
Yes, this is one area were the deployment plan is to start by colocating the mons, and improve the situation over time.. moving to dedicated mons as we do hardware refreshes. The mons would likely share the SSDs with the OSD journal at first.
If you want to use CephFS, you also need at least one metadata server ("mds"). For high availability a second standby instance is recommended. For large filesystems with a large number of files (or a large number of meta operations per second), you can also have several active mds servers, as long as you have at least one standby instance. The mds servers consume a large amount of RAM depending on their configuration and the number of files open simultaneously. Our mds servers use about 7 GB RAM with ~ 2 million cached inodes and ~ 1 million capabilities (ownership/locking information). Again it might not be advisable to run mds server colocated with osds.
Indeed, especially in our initial configuration the memory requirements of the OSDs are so high that I can't imagine we'd be able to run the MDSs on the same hosts.
Low latency network is highly recommended between ceph cluster hosts and the ceph clients. Use 10 GBit or better; 1 GBit works but the performance will be bad. Also consider redundancy (e.g. LACP bonding of links). The ceph cluster can use a public network for client access and an internal network for the replication between osd hosts. Unless you have special requirements I would recommend not to use a separate internal network due to the higher configuration efforts. All ceph hosts must be able to contact each other, and all ceph clients must be able to contact the ceph hosts. As an example, reading a file from a CephFS filesystem requires contacting a mon to retrieve the current mds, osd and pool information (once during mount), contacting the mds to retrieve the metadata for the file, and finally getting the data from the osd hosts.
The file servers and their client machines are on a shared 10G network already, so I think we're good there.
CephFS also does not perform any mds side authorization based on unix permissions (AFAIK). The access to the mds (and thus the filesystem) is controlled by a shared ceph user secret. You do not have the ability to use kerberos or server side unix group permissions checks. And you need to trust the clients since you need to store the ceph user secret on the client.
I think we're probably okay here. Reads and writes are split up into separate machines, and the ones that read have NFS mounts set read-only. We don't currently have a requirement to allow some users on a read host to be prevented from reading some data. But, let me speculate for a moment on some hypothetical future where we've got different access requirements for different data sets. If we can't restrict access to files by unix user/group permissions, would it make sense to run multiple clusters on the same hardware, having some OSDs on a host participate in one cluster, while other OSDs participate in a second, and share them out as separate CephFS mounts? Access could be controlled above the mount point in the filesystem, that way.
As mentioned above, you need at least three hosts in an initial setup and probably some hardware upgrade (RAM, SSDs). Do not be mislead by the possibility to setup a single host cluster; I won't consider such a setup even temporarly for migration purposes. It's an invitation for Murphy to strike...
If you cannot free three hosts completely, you can also run both setup side by side. Start with a small number of disks you can remove from the raid setups on each host, convert them to ceph osds, migrate data between the filesystems, proceed with more disks until all data is migrated. One important aspect you need to consider is the fact that you cannot change the number of coding/data chunks in a EC pool. If you want to use a EC pool for the filesystem, you need to create it with the number of coding/data chunks you want to have in the final setup.
This sounds like a major roadblock... or at least a delay. Shrinking volumes in ZFS is not (yet?) possible except by destroying the filesystem and rebuilding. I'm pretty sure the MD/LVM2 RAID has the same limitation. So, in order to run a migration side-by-side I need to be able to empty out an incredibly large volume of data anyway. I haven't worked through the whole procedure yet, but my gut feeling is that I'd still need to empty out the equivalent of two file servers to make it work. I can get one empty during our regular annual disk refresh, but two would require eliminating cross-chassis file duplication "temporarily" while data is moved around. Considering that it takes weeks to rebalance that much data this wouldn't be a popular idea.
I can picture being able to work around this if we replace any single file server with a set of smaller file servers, so perhaps we can make it work in a future hardware refresh.
I'm skipping over the rest of your email because I don't see anything there that's of concern.
Thanks a ton for your reply and the time you put into it! That was incredibly helpful in planning a safe path forward.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com