Re: Hardware needs for MDS for HPC/OpenStack workloads?

Stefan Kooman <stefan@xxxxxx> · Fri, 23 Oct 2020 08:03:40 +0200

On 2020-10-22 14:34, Matthew Vernon wrote:
> Hi,
> 
> We're considering the merits of enabling CephFS for our main Ceph
> cluster (which provides object storage for OpenStack), and one of the
> obvious questions is what sort of hardware we would need for the MDSs
> (and how many!).

Is it a many parallel large writes workload without a lot fs
manipulation (file creation / deletion, attribute updates? You might
only need 2 for HA (active-standby). But when used as a regular fs with
many clients and a lot of small IO, than you might run out of the
performance of a single MDS. Add (many) more as you see fit. Keep in
mind it does make things a bit more complex (different ranks when more
than one active MDS) and that when you need to upgrade you have to
downscale that to 1. You can pin directories to a single MDS if you know
your workload well enough.

> 
> These would be for our users scientific workloads, so they would need to
> provide reasonably high performance. For reference, we have 3060 6TB
> OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.

It really depend on the workload. If there are a lot of file / directory
operations the MDS needs to keep track of all that and needs to be able
to cache as well (inodes / dnodes). The more files/dirs, the more RAM
you need. We don't have PB of storage (but 39 TB for CephFS) but have
MDSes with 256 GB RAM for cache for all the little files and many dirs
we have. Prefer a few faster cores above many slower cores.

> 
> The minimum specs are very modest (2-3GB RAM, a tiny amount of disk,
> similar networking to the OSD nodes), but I'm not sure how much going
> beyond that is likely to be useful in production.

MDSes don't do a lot of traffic. Clients write directly to OSDs after
they have acquired capabilities (CAPS) from MDS.

> 
> I've also seen it suggested that an SSD-only pool is sensible for the
> CephFS metadata pool; how big is that likely to get?

Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the
RocksDB databases tend to get quite large. Especially when storing many
small files and lots of dirs. So if that happens to be the workload,
make sure you have plenty of them. We once put all cephfs_metadata on 30
NVMe ... and that was not a good thing. Spread that data out over as
many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash?
Cephfs_metadaa does not take up a lot of space, but Mimic does not have
as good administration on all space occupied as newer releases. But I
guess it's in the order of 5% of CephFS size. But again, this might be
wildly different on other deployments.

> 
> I'd be grateful for any pointers :)

I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed,
but 32 GB will be minimum I guess.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx