Hello Nathan, >>>> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I will have only 1 NVME card per node and 38 nodes. >>>> >>>> The use case is to offer cephfs volumes for a k8s platform, I plan to use an EC-POOL 8+3 for the cephfs_data pool. >>>> >>>> Do you have recommendations for the setup or mistakes to avoid? I use ceph-ansible to deploy all myclusters. >>> >>> In order to get optimal performance out of NVMe, you will want very >>> fast cores, and you will probably have to split each NVMe card into >>> 2-4 OSD partitions in order to throw enough cores at it. That's a good idea ! If I have enough time, I'll try to do some benchmark with 2 and 4 OSD partitions. >> I’ve been trying unsuccessfully to convince some folks of the need for fast cores, there’s the idea that the effect would be slight. Do >> you have any numbers? I’ve also read a claim that each BlueStore will use 3-4 cores, They’re listening to me though about splitting the >> card into multiple OSDs. > > Bluestore will use about 4 cores, but in my experience, the maximum > utilization I've seen has been something like: 100%, 100%, 50%, 50% > > So those first 2 cores are the bottleneck for pure OSD IOPS. This sort > of pattern isn't uncommon in multithreaded programs. This was on HDD > OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure > NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we > had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the > OSDs ran out of CPU and failed to heartbeat when under load. I believe > that if we had 50% faster cores, we might not have needed to do this. > > On SSDs/NVMe you can compensate for slower cores with more OSDs, but > of course only for parallel operations. Anything that is > serial+synchronous, not so much. I would expect something like 4 OSDs > per NVMe, 4 cores per OSD. That's already 16 cores per node just for > OSDs. > > Our bottleneck in practice is the Ceph MDS, which seems to use exactly > 2 cores and has no setting to change this. As far as I can tell, if we > had 50% faster cores just for the MDS, I would expect roughly +50% > performance in terms of metadata ops/second. Each filesystem has it's > own rank-0 MDS, so this load will be split across daemons. The MDS can > also use a ton of RAM (32GB) if the clients have a working set of 1 > million+ files. Multi-mds exists to further split the load, but is > quite new and I would not trust it. CephFS in general is likely where > you will have the most issues, as it both new and complex compared to > a simple object store. Having an MDS in standby-replay mode keeps it's > RAM cache synced with the active, so you get far faster failover ( > O(seconds) rather than O(minutes) with a few million file caps) but > you use the same RAM again. > > So, IMHO, you will want at least: > CPU: > 16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if > you don't expect a lot of simultaneous load?) > > RAM: > The Bluestore default is 4GB per OSD, so 16GB per node. > ~32GB of RAM per active and standby-replay MDS if you expect file > counts in the millions, so 64GB per filesystem. The context is 3 Intel Server 1U for MONs/MDSs/MGRs services + K8s daemons CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24c/48t) Memory : 64GB Disk OS : 2x Intel SSD DC S3520 240GB 38 Dell C4140 1U for OSD nodes : CPU : 2 x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (28c/56t) Memory : 384GB GPU : 4 Nvidia V100 32GB NVLink Disk OS : M.2 240G NVME : Dell 6.4TB NVME PCI-E Drive (Samsung PM1725b), only 1 slot available Each server is used in a k8s cluster to give access to GPUs and CPUs for X-learning labs. Ceph have to share the CPU and memory with the compute K8s cluster. > 128GB of RAM per node ought to do, if you have less than 14 filesystems? I plan to have only 1 filesystem. Thanks to all those useful information. Best regards, -- Yoann Moulin EPFL IC-IT _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx