Hi Lionel, These are all very good points we should consider, thanks for the analysis. Just a couple clarifications: - NVMe in this system are actually slotted in hot-plug front bays so a failure can be swapped online. However I do see your point about this otherwise being a non-optimal config. - Our 20 physical cores come out to be 40 HT cores to the system which we are hoping is adequate to do 60 OSD without raid devices. My experiences in other contexts lead me to believe a hyper-threaded core is pretty well the same as a phys core (perhaps with some exceptions depending on specific cases). regards, Ben On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton <lionel+ceph@xxxxxxxxxxx> wrote: > Hi, > > I'm not yet familiar with Jewel, so take this with a grain of salt. > > Le 18/05/2016 16:36, Benjeman Meekhof a écrit : >> We're in process of tuning a cluster that currently consists of 3 >> dense nodes with more to be added. The storage nodes have spec: >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores) >> - 384 GB RAM >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x >> LSI 9207-8e SAS 6Gbps > > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I > think your performance would be limited by the CPUs but Jewel is faster > AFAIK. > That said you could setup the 60 disks as RAID arrays to limit the > number of OSDs. This can be tricky but some people have reported doing > so successfully (IIRC using RAID5 in order to limit both the number of > OSDs and the rebalancing events when a disk fails). > >> - XFS filesystem on OSD data devs >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per >> raid-1 device) > > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB > conservative estimate, for 30 disks you'd need a write bandwidth of > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they > will take twice the amount of writes in RAID1. The alternative - using > NVMe directly for journals - will get better performance and have less > failures. The only drawback is that an NVMe failing entirely (I'm not > familiar with NVMe but with SSD you often get write errors affecting a > single OSD before a whole device failure) will bring down 15 OSDs at once. > Note that replacing NVMe usually means stopping the whole node when not > using hotplug PCIe, so not losing the journals when one fails may not > gain you as much as anticipated if the cluster must rebalance anyway > during the maintenance operation where your replace the faulty NVMe (and > might perform other upgrades/swaps that were waiting). > >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb > > Seems adequate although more bandwidth could be of some benefit. > > This is a total of ~12GB/s full duplex. If Ceph is able to use the whole > disk bandwidth you will saturate this : if you get a hotspot on one node > with a client capable of writing at 12GB/s on it and have a replication > size of 3, you will get only half of this (as twice this amount will be > sent on replicas). So ideally you would have room for twice the client > bandwidth on the cluster network. In my experience this isn't a problem > (hot spots like this almost never happen as client write traffic is > mostly distributed evenly on nodes) but having the headroom avoids the > risk of atypical access patterns becoming a problem so it seems like a > good thing if it doesn't cost too much. > Note that if your total NVMe write bandwidth is more than the total disk > bandwidth they act as buffers capable of handling short write bursts > (only if there's no read on recent writes which should almost never > happen for RBD but might for other uses) so you could limit your ability > to handle these. > > Best regards, > > Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com