Re: dense storage nodes

Benjeman Meekhof <bmeekhof@xxxxxxxxx> · Thu, 19 May 2016 10:26:37 -0400

Hi Christian,

Thanks for your insights.  To answer your question the NVMe devices
appear to be some variety of Samsung:

Model: Dell Express Flash NVMe 400GB
Manufacturer: SAMSUNG
Product ID: a820

regards,
Ben

On Wed, May 18, 2016 at 10:01 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> On Wed, 18 May 2016 12:32:25 -0400 Benjeman Meekhof wrote:
>
>> Hi Lionel,
>>
>> These are all very good points we should consider, thanks for the
>> analysis.  Just a couple clarifications:
>>
>> - NVMe in this system are actually slotted in hot-plug front bays so a
>> failure can be swapped online.  However I do see your point about this
>> otherwise being a non-optimal config.
>>
> What NVMes are these exactly? DC P3700?
> With Intel you can pretty much rely on them not to die before their time
> is up, so monitor wearout levels religiously and automatically (nagios
> etc).
> At a low node count like yours it is understandable to not want to loose
> 15 OSDs because a NVMe failed, but your performance and cost are both not
> ideal as Lionel said.
>
> I guess you're happy with what you have, but as I mentioned in this
> thread also about RAIDed OSDs, there is a chassis that does basically what
> you're having while saving 1U:
> https://www.supermicro.com.tw/products/system/4U/6048/SSG-6048R-E1CR60N.cfm
>
> This can also have optionally 6 NVMes, hot-swappable.
>
>> - Our 20 physical cores come out to be 40 HT cores to the system which
>> we are hoping is adequate to do 60 OSD without raid devices.  My
>> experiences in other contexts lead me to believe a hyper-threaded core
>> is pretty well the same as a phys core (perhaps with some exceptions
>> depending on specific cases).
>>
> It all depends, if you had no SSD journals at all I'd say you could scrape
> by, barely.
> With NVMes for journals, especially if you should decide to use them
> individually with 15 OSDs per NVMe, I'd expect CPU to become the
> bottleneck when dealing with a high number of small IOPS.
>
> Regards,
>
> Christian
>> regards,
>> Ben
>>
>> On Wed, May 18, 2016 at 12:02 PM, Lionel Bouton
>> <lionel+ceph@xxxxxxxxxxx> wrote:
>> > Hi,
>> >
>> > I'm not yet familiar with Jewel, so take this with a grain of salt.
>> >
>> > Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
>> >> We're in process of tuning a cluster that currently consists of 3
>> >> dense nodes with more to be added.  The storage nodes have spec:
>> >> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
>> >> - 384 GB RAM
>> >> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
>> >> LSI 9207-8e SAS 6Gbps
>> >
>> > I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
>> > think your performance would be limited by the CPUs but Jewel is faster
>> > AFAIK.
>> > That said you could setup the 60 disks as RAID arrays to limit the
>> > number of OSDs. This can be tricky but some people have reported doing
>> > so successfully (IIRC using RAID5 in order to limit both the number of
>> > OSDs and the rebalancing events when a disk fails).
>> >
>> >> - XFS filesystem on OSD data devs
>> >> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
>> >> raid-1 device)
>> >
>> > Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
>> > conservative estimate, for 30 disks you'd need a write bandwidth of
>> > 3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
>> > will take twice the amount of writes in RAID1. The alternative - using
>> > NVMe directly for journals - will get better performance and have less
>> > failures. The only drawback is that an NVMe failing entirely (I'm not
>> > familiar with NVMe but with SSD you often get write errors affecting a
>> > single OSD before a whole device failure) will bring down 15 OSDs at
>> > once. Note that replacing NVMe usually means stopping the whole node
>> > when not using hotplug PCIe, so not losing the journals when one fails
>> > may not gain you as much as anticipated if the cluster must rebalance
>> > anyway during the maintenance operation where your replace the faulty
>> > NVMe (and might perform other upgrades/swaps that were waiting).
>> >
>> >> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb
>> >
>> > Seems adequate although more bandwidth could be of some benefit.
>> >
>> > This is a total of ~12GB/s full duplex. If Ceph is able to use the
>> > whole disk bandwidth you will saturate this : if you get a hotspot on
>> > one node with a client capable of writing at 12GB/s on it and have a
>> > replication size of 3, you will get only half of this (as twice this
>> > amount will be sent on replicas). So ideally you would have room for
>> > twice the client bandwidth on the cluster network. In my experience
>> > this isn't a problem (hot spots like this almost never happen as
>> > client write traffic is mostly distributed evenly on nodes) but having
>> > the headroom avoids the risk of atypical access patterns becoming a
>> > problem so it seems like a good thing if it doesn't cost too much.
>> > Note that if your total NVMe write bandwidth is more than the total
>> > disk bandwidth they act as buffers capable of handling short write
>> > bursts (only if there's no read on recent writes which should almost
>> > never happen for RBD but might for other uses) so you could limit your
>> > ability to handle these.
>> >
>> > Best regards,
>> >
>> > Lionel
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com