Re: Choosing hp sata or sas SSDs for journals

Christian Balzer <chibi@xxxxxxx> · Thu, 5 Nov 2015 12:47:03 +0900

On Wed, 4 Nov 2015 15:33:16 +0100 Karsten Heymann wrote:

> Hi,
> 
> 2015-11-04 15:16 GMT+01:00 Christian Balzer <chibi@xxxxxxx>:
> > On Wed, 4 Nov 2015 12:03:51 +0100 Karsten Heymann wrote:
> >> I'm currently planning to use dl380 with 26 (24 at the front, two for
> >> system disks at the back) 2,5"-slots, from which roughly 2/3 are
> >> intended for osd drives, the rest for system and journal disks.
> >>
> > That's a pretty dense configuration, how many nodes do you plan to
> > deploy initially?
> 
> Somewhere between 5 and 10 nodes initially.
> 
The more, the better performance, especially when one node goes down.
Can you afford to loose 20% of your IOPS if one of 5 nodes fails?
Never mind the resulting IO storm from re-balancing the data, which you
can avoid if configuring Ceph correctly.
(mon_osd_down_out_subtree_limit = host)

> > What network infrastructure?
> 
> At least 2x 10GB/s Ethernet, probably 4x (2x client-facing, 2x
> intra-cluster).
> 

With the HW you have in mind, your nodes will be capable of about 1.8GB/s
writes and disk reads, reads from the pagecache are of course faster.

So yes, 2x10GB makes sense, however keep redundancy in mind, too.

1. 2 links, LACP
2. 2 links, failover
3. 2 links with vLAG/Trill 
4. 4 links (2 bonded pairs, failover)

Option 1 will give you full speed, but if a switch fails, half of your
nodes will be dead. Not a good situation even if your Crush map reflects
this and all data is still available.

Option 2 gives you less than the full speed, but OTOH you don't have to
worry about a single switch failure.
My nodes are about 1GB/s capable, so I went for an active/standby option
with Infiniband (IPoIB).

Option 3 (Brocade and other switches support this) will give you full
speed when both switches are up and still work at half speed if one fails.
This is the best choice if you can afford those switches and have a
network team comfortable with them.

Option 4 will give you full speed and redundancy, but at the cost of 4
ports per node.

Note that splitting things up into client (public) and replication
(private) network makes little sense in many cases, when you are trying to
get the most speed to and from the clients. 
This of course depends on your network design and infrastructure as well,
as in are your clients plugged into the same switches, if not what's the
switch interconnection capacity?

Christian

> > Check the archives for previous threads, I would allocate about 2 GHz
> > of CPU per OSD...
> 
> That fits with the cpus I chose.
> 
> >> So 18 spinning drives, 6 SSD (model #1) and two system disks seem to
> >> be at least a reasonable choice for a setup to start with?
> >>
> > Yes.
> > Note that in my example below the system disks are a RAID10 of the 4
> > SSDs, with raw partitions for the journals.
> 
> Interesting setup.
> 
> >> 200GB are the smallest enterprise drives HP sells for current server
> >> generations.
> >>
> > Yeah, but when you look at the Intel DC S37xx drives for example, the
> > older (more parallel) SSDs are actually faster at smaller size then the
> > new ones.
> 
> I think I have to stick to what HP offers.
> 
> Thanks a lot,
> Karsten
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com