Re: dense storage nodes

Christian Balzer <chibi@xxxxxxx> · Thu, 19 May 2016 11:01:47 +0900

Hello,

On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote:

> Hi Lionel,
> 
> These are all very good points we should consider, thanks for the
> analysis.  Just a couple clarifications:
> 
> - NVMe in this system are actually slotted in hot-plug front bays so a
> failure can be swapped online.  However I do see your point about this
> otherwise being a non-optimal config.
> 
What NVMes are these exactly? DC P3700?
With Intel you can pretty much rely on them not to die before their time
is up, so monitor wearout levels religiously and automatically (nagios
etc).
At a low node count like yours it is understandable to not want to loose
15 OSDs because a NVMe failed, but your performance and cost are both not
ideal as Lionel said.

I guess you're happy with what you have, but as I mentioned in this
thread also about RAIDed OSDs, there is a chassis that does basically what
you're having while saving 1U:
https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm

This can also have optionally 6 NVMes, hot-swappable.

> - Our 20 physical cores come out to be 40 HT cores to the system which
> we are hoping is adequate to do 60 OSD without raid devices.  My
> experiences in other contexts lead me to believe a hyper-threaded core
> is pretty well the same as a phys core (perhaps with some exceptions
> depending on specific cases).
> 
It all depends, if you had no SSD journals at all I'd say you could scrape
by, barely.
With NVMes for journals, especially if you should decide to use them
individually with 15 OSDs per NVMe, I'd expect CPU to become the
bottleneck when dealing with a high number of small IOPS.

Regards,

Christian
> regards,
> Ben
> 
> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton
> <lionel+ceph@xxxxxxxxxxx> wrote:
> > Hi,
> >
> > I'm not yet familiar with Jewel, so take this with a grain of salt.
> >
> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
> >> We're in process of tuning a cluster that currently consists of 3
> >> dense nodes with more to be added.  The storage nodes have spec:
> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
> >> - 384 GB RAM
> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
> >> LSI 9207-8e SAS 6Gbps
> >
> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
> > think your performance would be limited by the CPUs but Jewel is faster
> > AFAIK.
> > That said you could setup the 60 disks as RAID arrays to limit the
> > number of OSDs. This can be tricky but some people have reported doing
> > so successfully (IIRC using RAID5 in order to limit both the number of
> > OSDs and the rebalancing events when a disk fails).
> >
> >> - XFS filesystem on OSD data devs
> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
> >> raid-1 device)
> >
> > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
> > conservative estimate, for 30 disks you'd need a write bandwidth of
> > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
> > will take twice the amount of writes in RAID1. The alternative - using
> > NVMe directly for journals - will get better performance and have less
> > failures. The only drawback is that an NVMe failing entirely (I'm not
> > familiar with NVMe but with SSD you often get write errors affecting a
> > single OSD before a whole device failure) will bring down 15 OSDs at
> > once. Note that replacing NVMe usually means stopping the whole node
> > when not using hotplug PCIe, so not losing the journals when one fails
> > may not gain you as much as anticipated if the cluster must rebalance
> > anyway during the maintenance operation where your replace the faulty
> > NVMe (and might perform other upgrades/swaps that were waiting).
> >
> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
> >
> > Seems adequate although more bandwidth could be of some benefit.
> >
> > This is a total of ~12GB/s full duplex. If Ceph is able to use the
> > whole disk bandwidth you will saturate this : if you get a hotspot on
> > one node with a client capable of writing at 12GB/s on it and have a
> > replication size of 3, you will get only half of this (as twice this
> > amount will be sent on replicas). So ideally you would have room for
> > twice the client bandwidth on the cluster network. In my experience
> > this isn't a problem (hot spots like this almost never happen as
> > client write traffic is mostly distributed evenly on nodes) but having
> > the headroom avoids the risk of atypical access patterns becoming a
> > problem so it seems like a good thing if it doesn't cost too much.
> > Note that if your total NVMe write bandwidth is more than the total
> > disk bandwidth they act as buffers capable of handling short write
> > bursts (only if there's no read on recent writes which should almost
> > never happen for RBD but might for other uses) so you could limit your
> > ability to handle these.
> >
> > Best regards,
> >
> > Lionel
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com