Hi Christian, Thanks for your insights. To answer your question the NVMe devices appear to be some variety of Samsung: Model: Dell Express Flash NVMe 400GB Manufacturer: SAMSUNG Product ID: a820 regards, Ben On Wed, May 18, 2016 at 10:01 PM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote: > >> Hi Lionel, >> >> These are all very good points we should consider, thanks for the >> analysis. Just a couple clarifications: >> >> - NVMe in this system are actually slotted in hot-plug front bays so a >> failure can be swapped online. However I do see your point about this >> otherwise being a non-optimal config. >> > What NVMes are these exactly? DC P3700? > With Intel you can pretty much rely on them not to die before their time > is up, so monitor wearout levels religiously and automatically (nagios > etc). > At a low node count like yours it is understandable to not want to loose > 15 OSDs because a NVMe failed, but your performance and cost are both not > ideal as Lionel said. > > I guess you're happy with what you have, but as I mentioned in this > thread also about RAIDed OSDs, there is a chassis that does basically what > you're having while saving 1U: > https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm > > This can also have optionally 6 NVMes, hot-swappable. > >> - Our 20 physical cores come out to be 40 HT cores to the system which >> we are hoping is adequate to do 60 OSD without raid devices. My >> experiences in other contexts lead me to believe a hyper-threaded core >> is pretty well the same as a phys core (perhaps with some exceptions >> depending on specific cases). >> > It all depends, if you had no SSD journals at all I'd say you could scrape > by, barely. > With NVMes for journals, especially if you should decide to use them > individually with 15 OSDs per NVMe, I'd expect CPU to become the > bottleneck when dealing with a high number of small IOPS. > > Regards, > > Christian >> regards, >> Ben >> >> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton >> <lionel+ceph@xxxxxxxxxxx> wrote: >> > Hi, >> > >> > I'm not yet familiar with Jewel, so take this with a grain of salt. >> > >> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit : >> >> We're in process of tuning a cluster that currently consists of 3 >> >> dense nodes with more to be added. The storage nodes have spec: >> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores) >> >> - 384 GB RAM >> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x >> >> LSI 9207-8e SAS 6Gbps >> > >> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I >> > think your performance would be limited by the CPUs but Jewel is faster >> > AFAIK. >> > That said you could setup the 60 disks as RAID arrays to limit the >> > number of OSDs. This can be tricky but some people have reported doing >> > so successfully (IIRC using RAID5 in order to limit both the number of >> > OSDs and the rebalancing events when a disk fails). >> > >> >> - XFS filesystem on OSD data devs >> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per >> >> raid-1 device) >> > >> > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB >> > conservative estimate, for 30 disks you'd need a write bandwidth of >> > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they >> > will take twice the amount of writes in RAID1. The alternative - using >> > NVMe directly for journals - will get better performance and have less >> > failures. The only drawback is that an NVMe failing entirely (I'm not >> > familiar with NVMe but with SSD you often get write errors affecting a >> > single OSD before a whole device failure) will bring down 15 OSDs at >> > once. Note that replacing NVMe usually means stopping the whole node >> > when not using hotplug PCIe, so not losing the journals when one fails >> > may not gain you as much as anticipated if the cluster must rebalance >> > anyway during the maintenance operation where your replace the faulty >> > NVMe (and might perform other upgrades/swaps that were waiting). >> > >> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb >> > >> > Seems adequate although more bandwidth could be of some benefit. >> > >> > This is a total of ~12GB/s full duplex. If Ceph is able to use the >> > whole disk bandwidth you will saturate this : if you get a hotspot on >> > one node with a client capable of writing at 12GB/s on it and have a >> > replication size of 3, you will get only half of this (as twice this >> > amount will be sent on replicas). So ideally you would have room for >> > twice the client bandwidth on the cluster network. In my experience >> > this isn't a problem (hot spots like this almost never happen as >> > client write traffic is mostly distributed evenly on nodes) but having >> > the headroom avoids the risk of atypical access patterns becoming a >> > problem so it seems like a good thing if it doesn't cost too much. >> > Note that if your total NVMe write bandwidth is more than the total >> > disk bandwidth they act as buffers capable of handling short write >> > bursts (only if there's no read on recent writes which should almost >> > never happen for RBD but might for other uses) so you could limit your >> > ability to handle these. >> > >> > Best regards, >> > >> > Lionel >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com