Re: dense storage nodes

Christian Balzer <chibi@xxxxxxx> · Sat, 21 May 2016 13:47:26 +0900

On Thu, 19 May 2016 10:26:37 -0400 Benjeman Meekhof wrote:

> Hi Christian,
> 
> Thanks for your insights.  To answer your question the NVMe devices
> appear to be some variety of Samsung:
> 
> Model: Dell Express Flash NVMe 400GB
> Manufacturer: SAMSUNG
> Product ID: a820
> 

Alright, these appear to be 7 DWPD devices (hidden in the PDF text, not
the feature table, grumble), so you should be fine unless your use case is
insanely write heavy. 

Since they're Samsung sourced you probably have these SMART attributes:

177 Wear_Leveling_Count 
179 Used_Rsvd_Blk_Cnt_Tot

to watch.

At 1.4GB/s writes per that spec sheet you have at most 2.8GB/s write
capacity to your journals in your current RAID1 setup.

Contrast that with your network speed of 5GB/s (or 10GB/s if active-active
bonded links).
4 indvidual NVMes with 5.6GB/s write capacity are obviously a much better
fit for you.

Regards,

Christian

> regards,
> Ben
> 
> On Wed, May 18, 2016 at 10:01 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> > On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote:
> >
> >> Hi Lionel,
> >>
> >> These are all very good points we should consider, thanks for the
> >> analysis.  Just a couple clarifications:
> >>
> >> - NVMe in this system are actually slotted in hot-plug front bays so a
> >> failure can be swapped online.  However I do see your point about this
> >> otherwise being a non-optimal config.
> >>
> > What NVMes are these exactly? DC P3700?
> > With Intel you can pretty much rely on them not to die before their
> > time is up, so monitor wearout levels religiously and automatically
> > (nagios etc).
> > At a low node count like yours it is understandable to not want to
> > loose 15 OSDs because a NVMe failed, but your performance and cost are
> > both not ideal as Lionel said.
> >
> > I guess you're happy with what you have, but as I mentioned in this
> > thread also about RAIDed OSDs, there is a chassis that does basically
> > what you're having while saving 1U:
> > https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm
> >
> > This can also have optionally 6 NVMes, hot-swappable.
> >
> >> - Our 20 physical cores come out to be 40 HT cores to the system which
> >> we are hoping is adequate to do 60 OSD without raid devices.  My
> >> experiences in other contexts lead me to believe a hyper-threaded core
> >> is pretty well the same as a phys core (perhaps with some exceptions
> >> depending on specific cases).
> >>
> > It all depends, if you had no SSD journals at all I'd say you could
> > scrape by, barely.
> > With NVMes for journals, especially if you should decide to use them
> > individually with 15 OSDs per NVMe, I'd expect CPU to become the
> > bottleneck when dealing with a high number of small IOPS.
> >
> > Regards,
> >
> > Christian
> >> regards,
> >> Ben
> >>
> >> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton
> >> <lionel+ceph@xxxxxxxxxxx> wrote:
> >> > Hi,
> >> >
> >> > I'm not yet familiar with Jewel, so take this with a grain of salt.
> >> >
> >> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
> >> >> We're in process of tuning a cluster that currently consists of 3
> >> >> dense nodes with more to be added.  The storage nodes have spec:
> >> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
> >> >> - 384 GB RAM
> >> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via
> >> >> 2 x LSI 9207-8e SAS 6Gbps
> >> >
> >> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With
> >> > Firefly I think your performance would be limited by the CPUs but
> >> > Jewel is faster AFAIK.
> >> > That said you could setup the 60 disks as RAID arrays to limit the
> >> > number of OSDs. This can be tricky but some people have reported
> >> > doing so successfully (IIRC using RAID5 in order to limit both the
> >> > number of OSDs and the rebalancing events when a disk fails).
> >> >
> >> >> - XFS filesystem on OSD data devs
> >> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30
> >> >> per raid-1 device)
> >> >
> >> > Your disks are rated at a maximum of ~200MB/s so even with a
> >> > 100-150MB conservative estimate, for 30 disks you'd need a write
> >> > bandwidth of 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die
> >> > twice as fast as they will take twice the amount of writes in
> >> > RAID1. The alternative - using NVMe directly for journals - will
> >> > get better performance and have less failures. The only drawback is
> >> > that an NVMe failing entirely (I'm not familiar with NVMe but with
> >> > SSD you often get write errors affecting a single OSD before a
> >> > whole device failure) will bring down 15 OSDs at once. Note that
> >> > replacing NVMe usually means stopping the whole node when not using
> >> > hotplug PCIe, so not losing the journals when one fails may not
> >> > gain you as much as anticipated if the cluster must rebalance
> >> > anyway during the maintenance operation where your replace the
> >> > faulty NVMe (and might perform other upgrades/swaps that were
> >> > waiting).
> >> >
> >> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
> >> >
> >> > Seems adequate although more bandwidth could be of some benefit.
> >> >
> >> > This is a total of ~12GB/s full duplex. If Ceph is able to use the
> >> > whole disk bandwidth you will saturate this : if you get a hotspot
> >> > on one node with a client capable of writing at 12GB/s on it and
> >> > have a replication size of 3, you will get only half of this (as
> >> > twice this amount will be sent on replicas). So ideally you would
> >> > have room for twice the client bandwidth on the cluster network. In
> >> > my experience this isn't a problem (hot spots like this almost
> >> > never happen as client write traffic is mostly distributed evenly
> >> > on nodes) but having the headroom avoids the risk of atypical
> >> > access patterns becoming a problem so it seems like a good thing if
> >> > it doesn't cost too much. Note that if your total NVMe write
> >> > bandwidth is more than the total disk bandwidth they act as buffers
> >> > capable of handling short write bursts (only if there's no read on
> >> > recent writes which should almost never happen for RBD but might
> >> > for other uses) so you could limit your ability to handle these.
> >> >
> >> > Best regards,
> >> >
> >> > Lionel
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com