Hi, I'm not yet familiar with Jewel, so take this with a grain of salt. Le 18/05/2016 16:36, Benjeman Meekhof a écrit : > We're in process of tuning a cluster that currently consists of 3 > dense nodes with more to be added. The storage nodes have spec: > - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores) > - 384 GB RAM > - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x > LSI 9207-8e SAS 6Gbps I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I think your performance would be limited by the CPUs but Jewel is faster AFAIK. That said you could setup the 60 disks as RAID arrays to limit the number of OSDs. This can be tricky but some people have reported doing so successfully (IIRC using RAID5 in order to limit both the number of OSDs and the rebalancing events when a disk fails). > - XFS filesystem on OSD data devs > - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per > raid-1 device) Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB conservative estimate, for 30 disks you'd need a write bandwidth of 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they will take twice the amount of writes in RAID1. The alternative - using NVMe directly for journals - will get better performance and have less failures. The only drawback is that an NVMe failing entirely (I'm not familiar with NVMe but with SSD you often get write errors affecting a single OSD before a whole device failure) will bring down 15 OSDs at once. Note that replacing NVMe usually means stopping the whole node when not using hotplug PCIe, so not losing the journals when one fails may not gain you as much as anticipated if the cluster must rebalance anyway during the maintenance operation where your replace the faulty NVMe (and might perform other upgrades/swaps that were waiting). > - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb Seems adequate although more bandwidth could be of some benefit. This is a total of ~12GB/s full duplex. If Ceph is able to use the whole disk bandwidth you will saturate this : if you get a hotspot on one node with a client capable of writing at 12GB/s on it and have a replication size of 3, you will get only half of this (as twice this amount will be sent on replicas). So ideally you would have room for twice the client bandwidth on the cluster network. In my experience this isn't a problem (hot spots like this almost never happen as client write traffic is mostly distributed evenly on nodes) but having the headroom avoids the risk of atypical access patterns becoming a problem so it seems like a good thing if it doesn't cost too much. Note that if your total NVMe write bandwidth is more than the total disk bandwidth they act as buffers capable of handling short write bursts (only if there's no read on recent writes which should almost never happen for RBD but might for other uses) so you could limit your ability to handle these. Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com