Re: Full FLash NVME Cluster recommendation

Nathan Fish <lordcirth@xxxxxxxxx> · Fri, 15 Nov 2019 12:25:11 -0500

Bluestore will use about 4 cores, but in my experience, the maximum
utilization I've seen has been something like: 100%, 100%, 50%, 50%

So those first 2 cores are the bottleneck for pure OSD IOPS. This sort
of pattern isn't uncommon in multithreaded programs. This was on HDD
OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure
NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we
had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the
OSDs ran out of CPU and failed to heartbeat when under load. I believe
that if we had 50% faster cores, we might not have needed to do this.

On SSDs/NVMe you can compensate for slower cores with more OSDs, but
of course only for parallel operations. Anything that is
serial+synchronous, not so much. I would expect something like 4 OSDs
per NVMe, 4 cores per OSD. That's already 16 cores per node just for
OSDs.

Our bottleneck in practice is the Ceph MDS, which seems to use exactly
2 cores and has no setting to change this. As far as I can tell, if we
had 50% faster cores just for the MDS, I would expect roughly +50%
performance in terms of metadata ops/second. Each filesystem has it's
own rank-0 MDS, so this load will be split across daemons. The MDS can
also use a ton of RAM (32GB) if the clients have a working set of 1
million+ files. Multi-mds exists to further split the load, but is
quite new and I would not trust it. CephFS in general is likely where
you will have the most issues, as it both new and complex compared to
a simple object store. Having an MDS in standby-replay mode keeps it's
RAM cache synced with the active, so you get far faster failover (
O(seconds) rather than O(minutes) with a few million file caps) but
you use the same RAM again.

So, IMHO, you will want at least:
CPU:
16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if
you don't expect a lot of simultaneous load?)

RAM:
The Bluestore default is 4GB per OSD, so 16GB per node.
~32GB of RAM per active and standby-replay MDS if you expect file
counts in the millions, so 64GB per filesystem.

128GB of RAM per node ought to do, if you have less than 14 filesystems?

YMMV.

On Fri, Nov 15, 2019 at 11:17 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> I’ve been trying unsuccessfully to convince some folks of the need for fast cores, there’s the idea that the effect would be slight.  Do you have any numbers?  I’ve also read a claim that each BlueStore will use 3-4 cores
>,
> They’re listening to me though about splitting the card into multiple OSDs.
>
> > On Nov 15, 2019, at 7:38 AM, Nathan Fish <lordcirth@xxxxxxxxx> wrote:
> >
> > In order to get optimal performance out of NVMe, you will want very
> > fast cores, and you will probably have to split each NVMe card into
> > 2-4 OSD partitions in order to throw enough cores at it.
> >
> > On Fri, Nov 15, 2019 at 10:24 AM Yoann Moulin <yoann.moulin@xxxxxxx> wrote:
> >>
> >> Hello,
> >>
> >> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I will have only 1 NVME card per node and 38 nodes.
> >>
> >> The use case is to offer cephfs volumes for a k8s platform, I plan to use an EC-POOL 8+3 for the cephfs_data pool.
> >>
> >> Do you have recommendations for the setup or mistakes to avoid? I use ceph-ansible to deploy all myclusters.
> >>
> >> Best regards,
> >>
> >> --
> >> Yoann Moulin
> >> EPFL IC-IT
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com