Re: Full FLash NVME Cluster recommendation

Yoann Moulin <yoann.moulin@xxxxxxx> · Mon, 18 Nov 2019 13:23:02 +0100

Hello Nathan,

>>>> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I will have only 1 NVME card per node and 38 nodes.
>>>>
>>>> The use case is to offer cephfs volumes for a k8s platform, I plan to use an EC-POOL 8+3 for the cephfs_data pool.
>>>>
>>>> Do you have recommendations for the setup or mistakes to avoid? I use ceph-ansible to deploy all myclusters.
>>>
>>> In order to get optimal performance out of NVMe, you will want very
>>> fast cores, and you will probably have to split each NVMe card into
>>> 2-4 OSD partitions in order to throw enough cores at it.

That's a good idea ! If I have enough time, I'll try to do some benchmark with 2 and 4 OSD partitions.

>> I’ve been trying unsuccessfully to convince some folks of the need for fast cores, there’s the idea that the effect would be slight.  Do
>> you have any numbers?  I’ve also read a claim that each BlueStore will use 3-4 cores, They’re listening to me though about splitting the
>> card into multiple OSDs.
>
> Bluestore will use about 4 cores, but in my experience, the maximum
> utilization I've seen has been something like: 100%, 100%, 50%, 50%
>
> So those first 2 cores are the bottleneck for pure OSD IOPS. This sort
> of pattern isn't uncommon in multithreaded programs. This was on HDD
> OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure
> NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we
> had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the
> OSDs ran out of CPU and failed to heartbeat when under load. I believe
> that if we had 50% faster cores, we might not have needed to do this.
>
> On SSDs/NVMe you can compensate for slower cores with more OSDs, but
> of course only for parallel operations. Anything that is
> serial+synchronous, not so much. I would expect something like 4 OSDs
> per NVMe, 4 cores per OSD. That's already 16 cores per node just for
> OSDs.
>
> Our bottleneck in practice is the Ceph MDS, which seems to use exactly
> 2 cores and has no setting to change this. As far as I can tell, if we
> had 50% faster cores just for the MDS, I would expect roughly +50%
> performance in terms of metadata ops/second. Each filesystem has it's
> own rank-0 MDS, so this load will be split across daemons. The MDS can
> also use a ton of RAM (32GB) if the clients have a working set of 1
> million+ files. Multi-mds exists to further split the load, but is
> quite new and I would not trust it. CephFS in general is likely where
> you will have the most issues, as it both new and complex compared to
> a simple object store. Having an MDS in standby-replay mode keeps it's
> RAM cache synced with the active, so you get far faster failover (
> O(seconds) rather than O(minutes) with a few million file caps) but
> you use the same RAM again.
>
> So, IMHO, you will want at least:
> CPU:
> 16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if
> you don't expect a lot of simultaneous load?)
>
> RAM:
> The Bluestore default is 4GB per OSD, so 16GB per node.
> ~32GB of RAM per active and standby-replay MDS if you expect file
> counts in the millions, so 64GB per filesystem.

The context is

3 Intel Server 1U for MONs/MDSs/MGRs services + K8s daemons
CPU 	: 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24c/48t)
Memory	: 64GB
Disk OS	: 2x Intel SSD DC S3520 240GB

38 Dell C4140 1U for OSD nodes :
CPU	: 2 x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (28c/56t)
Memory	: 384GB
GPU	: 4 Nvidia V100 32GB NVLink
Disk OS	: M.2 240G
NVME	: Dell 6.4TB NVME PCI-E Drive (Samsung PM1725b), only 1 slot available

Each server is used in a k8s cluster to give access to GPUs and CPUs for X-learning labs.

Ceph have to share the CPU and memory with the compute K8s cluster.

> 128GB of RAM per node ought to do, if you have less than 14 filesystems?

I plan to have only 1 filesystem.

Thanks to all those useful information.

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx