Re: Hardware needs for MDS for HPC/OpenStack workloads?

Nathan Fish <lordcirth@xxxxxxxxx> · Fri, 23 Oct 2020 09:07:00 -0400

Regarding MDS pinning, we have our home directories split into u{0..9}
for legacy reasons, and while adding more MDS' helped a little,
pinning certain u? to certain MDS' helped greatly. The automatic
migration between MDS' killed performance. This is an unusually
perfect workload for pinning, as we have 10 practically identical
directories, but still.

On Fri, Oct 23, 2020 at 2:04 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
> On 2020-10-22 14:34, Matthew Vernon wrote:
> > Hi,
> >
> > We're considering the merits of enabling CephFS for our main Ceph
> > cluster (which provides object storage for OpenStack), and one of the
> > obvious questions is what sort of hardware we would need for the MDSs
> > (and how many!).
>
> Is it a many parallel large writes workload without a lot fs
> manipulation (file creation / deletion, attribute updates? You might
> only need 2 for HA (active-standby). But when used as a regular fs with
> many clients and a lot of small IO, than you might run out of the
> performance of a single MDS. Add (many) more as you see fit. Keep in
> mind it does make things a bit more complex (different ranks when more
> than one active MDS) and that when you need to upgrade you have to
> downscale that to 1. You can pin directories to a single MDS if you know
> your workload well enough.
>
> >
> > These would be for our users scientific workloads, so they would need to
> > provide reasonably high performance. For reference, we have 3060 6TB
> > OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.
>
> It really depend on the workload. If there are a lot of file / directory
> operations the MDS needs to keep track of all that and needs to be able
> to cache as well (inodes / dnodes). The more files/dirs, the more RAM
> you need. We don't have PB of storage (but 39 TB for CephFS) but have
> MDSes with 256 GB RAM for cache for all the little files and many dirs
> we have. Prefer a few faster cores above many slower cores.
>
>
> >
> > The minimum specs are very modest (2-3GB RAM, a tiny amount of disk,
> > similar networking to the OSD nodes), but I'm not sure how much going
> > beyond that is likely to be useful in production.
>
> MDSes don't do a lot of traffic. Clients write directly to OSDs after
> they have acquired capabilities (CAPS) from MDS.
>
> >
> > I've also seen it suggested that an SSD-only pool is sensible for the
> > CephFS metadata pool; how big is that likely to get?
>
> Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the
> RocksDB databases tend to get quite large. Especially when storing many
> small files and lots of dirs. So if that happens to be the workload,
> make sure you have plenty of them. We once put all cephfs_metadata on 30
> NVMe ... and that was not a good thing. Spread that data out over as
> many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash?
> Cephfs_metadaa does not take up a lot of space, but Mimic does not have
> as good administration on all space occupied as newer releases. But I
> guess it's in the order of 5% of CephFS size. But again, this might be
> wildly different on other deployments.
>
> >
> > I'd be grateful for any pointers :)
>
> I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed,
> but 32 GB will be minimum I guess.
>
> Gr. Stefan
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx