Seperate metadata pool in 3x MDS node

Özkan Göksu <ozkangksu@xxxxxxxxx> · Sat, 24 Feb 2024 21:54:55 +0300

Hello folks!

I'm designing a new Ceph storage from scratch and I want to increase CephFS
speed and decrease latency.
Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy
MDS and MON's on the same servers.
This time a weird idea came to my mind and I think it has great potential
and will perform better on paper with my limited knowledge.

I have 5 racks and the 3nd "middle" rack is my storage and management rack.

- At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)

- My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5

With my current workflow, all the clients;
1- visit the rack data switch
2- jump to main VPC switch via 2x100G,
3- talk with MDS servers,
4- Go back to the client with the answer,
5- To access data follow the same HOP's and visit the OSD's everytime.

If I deploy separate metadata pool by using 4x MDS server at top of
RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
2x 960GB U.2 NVME "MZ-PLK3T20")
Then all the clients will make the request directly in-rack 1 HOP away MDS
servers and if the request is only metadata, then the MDS node doesn't need
to redirect the request to OSD nodes.
Also by locating MDS servers with seperated metadata pool across all the
racks will reduce the high load on main VPC switch at RACK-3

If I'm not missing anything then only Recovery workload will suffer with
this topology.

What do you think?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx