Re: Seperate metadata pool in 3x MDS node

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 24 Feb 2024 15:46:24 -0500

> 
> I'm designing a new Ceph storage from scratch and I want to increase CephFS
> speed and decrease latency.
> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's)

Just go with pure-NVMe servers.  NVMe SSDs shouldn't cost much if anything more than the few remaining SATA or especially SAS SSDs, and you don't have to pay for an anachronistic HBA.  Your OSD lifecycle is way simpler too.  And five years from now you won't have to search eBay for additional drives.

> I have 5 racks

Nice.

> and the 3nd "middle" rack is my storage and management rack.

Consider striping all of your services over all 5 racks for fault tolerance.  That also balances out your power and network usage.

> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)

You're limited to older / used servers?  4x 25GE is dramatic overkill with these resources.  Put your available dollars toward better drives and servers if you can.  Are these your bulk data OSDs?

Is this all pre-existing gear?  The MZ-PLK3T20 is a high-durability AIC from 7 years ago -- and AIC's are all but history in the NVMe world.  A U.2 / E1.S / E3.S system would be more futureproof if you can swing it.

> 
> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5
> 
> With my current workflow, all the clients;
> 1- visit the rack data switch
> 2- jump to main VPC switch via 2x100G,
> 3- talk with MDS servers,
> 4- Go back to the client with the answer,
> 5- To access data follow the same HOP's and visit the OSD's everytime.

> 
> If I deploy separate metadata pool by using 4x MDS server at top of RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
> 2x 960GB U.2 NVME "MZ-PLK3T20")

Do you mean MZ-WLK3T20 ?  You don't need a separate cluster / replication network, especially with small numbers of slow OSDs.

> Then all the clients will make the request directly in-rack 1 HOP away MDS

Are your clients and ranks pinned such that each rack of clients will necessarily talk to the MDS in that rack?  Don't assume that Ceph will do that without action on your part.

> servers and if the request is only metadata, then the MDS node doesn't need to redirect the request to OSD nodes.

Others with more CephFS experience may weigh in, but you might do well to have a larger number of SSDs for the CephFS metadata pool.  For sure make sure that you have a decent number of PGs in that pool, i.e. more than the autoscaler will do by default.

> Also by locating MDS servers with seperated metadata pool across all the
> racks will reduce the high load on main VPC switch at RACK-3
> 
> If I'm not missing anything then only Recovery workload will suffer with
> this topology.

I wouldn't worry so much about network hops.

> 
> What do you think?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx