Re: dense storage nodes

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Wed, 18 May 2016 18:02:54 +0200

Hi,

I'm not yet familiar with Jewel, so take this with a grain of salt.

Le 18/05/2016 16:36, Benjeman Meekhof a écrit :
> We're in process of tuning a cluster that currently consists of 3
> dense nodes with more to be added.  The storage nodes have spec:
> - Dell R730xd 2 x Xeon E5-2650 v3 @ 2.30GHz (20 phys cores)
> - 384 GB RAM
> - 60 x 8TB HGST HUH728080AL5204 in MD3060e enclosure attached via 2 x
> LSI 9207-8e SAS 6Gbps

I'm not sure if 20 cores is enough for 60 OSDs on Jewel. With Firefly I
think your performance would be limited by the CPUs but Jewel is faster
AFAIK.
That said you could setup the 60 disks as RAID arrays to limit the
number of OSDs. This can be tricky but some people have reported doing
so successfully (IIRC using RAID5 in order to limit both the number of
OSDs and the rebalancing events when a disk fails).

> - XFS filesystem on OSD data devs
> - 4 x 400GB NVMe arranged into 2 mdraid devices for journals (30 per
> raid-1 device)

Your disks are rated at a maximum of ~200MB/s so even with a 100-150MB
conservative estimate, for 30 disks you'd need a write bandwidth of
3GB/s to 4.5GB/s on each NVMe. Your NVMe will die twice as fast as they
will take twice the amount of writes in RAID1. The alternative - using
NVMe directly for journals - will get better performance and have less
failures. The only drawback is that an NVMe failing entirely (I'm not
familiar with NVMe but with SSD you often get write errors affecting a
single OSD before a whole device failure) will bring down 15 OSDs at once.
Note that replacing NVMe usually means stopping the whole node when not
using hotplug PCIe, so not losing the journals when one fails may not
gain you as much as anticipated if the cluster must rebalance anyway
during the maintenance operation where your replace the faulty NVMe (and
might perform other upgrades/swaps that were waiting).

> - 2 x 25Gb Mellanox ConnectX-4 Lx dual port (4 x 25Gb

Seems adequate although more bandwidth could be of some benefit.

This is a total of ~12GB/s full duplex. If Ceph is able to use the whole
disk bandwidth you will saturate this : if you get a hotspot on one node
with a client capable of writing at 12GB/s on it and have a replication
size of 3, you will get only half of this (as twice this amount will be
sent on replicas). So ideally you would have room for twice the client
bandwidth on the cluster network. In my experience this isn't a problem
(hot spots like this almost never happen as client write traffic is
mostly distributed evenly on nodes) but having the headroom avoids the
risk of atypical access patterns becoming a problem so it seems like a
good thing if it doesn't cost too much.
Note that if your total NVMe write bandwidth is more than the total disk
bandwidth they act as buffers capable of handling short write bursts
(only if there's no read on recent writes which should almost never
happen for RBD but might for other uses) so you could limit your ability
to handle these.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com