Re: Seperate metadata pool in 3x MDS node

Özkan Göksu <ozkangksu@xxxxxxxxx> · Mon, 26 Feb 2024 18:25:32 +0300

 Hello Anthony,

The hardware is second hand built and does not have U.2 slots. U.2 servers
cost 3x-4x more.I mean PCI-E "MZ-PLK3T20".
I have to buy SFP cards and 25G is only +30$ more than 10G so why not.
Yes I'm thinking pinned as (clients > rack MDS)
I don't have problems with building and I don't use PG autoscaler.

Hello David.

My system is all internal and I only use one /20 subnet at layer2 level
Yes , I'm thinking of distributing the meta pool on racks 1,2,4,5 because
my clients use search a lot and I just want to shorten the metadata needs.
I have redundant rack PDU's so I don't have any problem with power and I
only have a VPC (2x n9k switch) on the main rack 3. That's why I keep data
and management related everything on rack3 as usual.
Normally I always use WAL+DB on NVME with Sata OSD. The only thing I wonder
is having a separate metadata pool on NVME located on the client racks is
gonna give some benefit or not.

Regards.

David C. <david.casier@xxxxxxxx>, 25 Şub 2024 Paz, 00:07 tarihinde şunu
yazdı:

> Hello,
>
> Each rack works on different trees or is everything parallelized ?
> The meta pools would be distributed over racks 1,2,4,5 ?
> If it is distributed, even if the addressed MDS is on the same switch as
> the client, you will always have this MDS which will consult/write (nvme)
> OSDs on the other racks (among 1,2,4,5).
>
> In any case, the exercise is interesting.
>
>
>
> Le sam. 24 févr. 2024 à 19:56, Özkan Göksu <ozkangksu@xxxxxxxxx> a écrit :
>
>> Hello folks!
>>
>> I'm designing a new Ceph storage from scratch and I want to increase
>> CephFS
>> speed and decrease latency.
>> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy
>> MDS and MON's on the same servers.
>> This time a weird idea came to my mind and I think it has great potential
>> and will perform better on paper with my limited knowledge.
>>
>> I have 5 racks and the 3nd "middle" rack is my storage and management
>> rack.
>>
>> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
>> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)
>>
>> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5
>>
>> With my current workflow, all the clients;
>> 1- visit the rack data switch
>> 2- jump to main VPC switch via 2x100G,
>> 3- talk with MDS servers,
>> 4- Go back to the client with the answer,
>> 5- To access data follow the same HOP's and visit the OSD's everytime.
>>
>> If I deploy separate metadata pool by using 4x MDS server at top of
>> RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
>> 2x 960GB U.2 NVME "MZ-PLK3T20")
>> Then all the clients will make the request directly in-rack 1 HOP away MDS
>> servers and if the request is only metadata, then the MDS node doesn't
>> need
>> to redirect the request to OSD nodes.
>> Also by locating MDS servers with seperated metadata pool across all the
>> racks will reduce the high load on main VPC switch at RACK-3
>>
>> If I'm not missing anything then only Recovery workload will suffer with
>> this topology.
>>
>> What do you think?
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx