advice with hardware configuration

chibi@xxxxxxx (Christian Balzer) · Wed, 7 May 2014 19:25:19 +0900

On Wed, 07 May 2014 11:01:33 +0200 Xabier Elkano wrote:

> El 06/05/14 18:40, Christian Balzer escribi?:
> > Hello,
> >
> > On Tue, 06 May 2014 17:07:33 +0200 Xabier Elkano wrote:
> >
> >> Hi,
> >>
> >> I'm designing a new ceph pool with new hardware and I would like to
> >> receive some suggestion.
> >> I want to use a replica count of 3 in the pool and the idea is to buy
> >> 3 new servers with a 10-drive 2,5" chassis each and 2 10Gbps nics. I
> >> have in mind two configurations:
> >>
> > As Wido said, more nodes are usually better, unless you're quite aware
> > of what you're doing and why.
> Yes, I know that, but what is the minimum number of nodes to start with?
> Start with three nodes is not a feasible option?

I've started a cluster with 2 nodes and feel that is my case it is a very
feasible option as the OSDs are really RAIDs and thus will basically never
fail and the IO load will be still manageable by one surviving storage
node.

You need to fully understand what happens when you loose one node (and
when it comes back) and if the consequences are acceptable to you. 

That same cluster I've build with 2 high-density nodes would have been 7
lower density nodes if right in the "Ceph" way.

> >  
> >> 1- With journal in SSDs
> >>  
> >> OS: 2xSSD intel SC3500 100G Raid 1
> >> Journal: 2xSSD intel SC3700 100G, 3 journal for each SSD
> > As I wrote just a moment ago, use at least the 200GB ones if
> > performance is such an issue for you.
> > If you can afford it, use 4 3700s and share OS and journal, the OS IOPS
> > will not be that significant, especially if you're using a writeback
> > cache controller. 
> the journal can be shared with the OS, but I like the RAID 1 for the OS.
> I think that the only drawback with it is that I am using two dedicated
> disk slots for OS.

Use software RAID 1 (or 10) on part of the 4 SSDs and put 1-2 journals on
each SSD. Or use 3 SSDs with 7 HDDS and use 2-3 journals on each SSD.
Either way you will have better performance and less impact if a SSD
should fail than with your original design.
A case with 12 drive bays would result in a perfectly equal load
distribution (4 SSDs, 8 HDDs).

As an aside, 3500's are pretty much overkill for OS only, 530s should do
fine.

> >
> >> OSD: 6 SAS10K 900G (SAS2 6Gbps), each running an OSD process. Total
> >> size for OSDs: 5,4TB
> >>
> >> 2- With journal in a partition in the spinners.
> >>
> >> OS: 2xSSD intel SC3500 100G Raid 1
> >> OSD+journal: 8 SAS15K 600G (SAS3 12Gbps), each runing an OSD process
> >> and its journal. Total size for OSDs: 3,6TB
> >>
> > I have no idea why anybody would spend money on 12Gb/s HDDs when even
> > most SSDs have trouble saturating a 6Gb/s link.
> > Given the double write penalty in IOPS, I think you're going to find
> > this more expensive (per byte) and slower than a well rounded option 1.
> But these disks are 2,5" 15K, not only for the link. Other SAS 2,5"
> (SAS2) disks I found are only 10K. The 15K disks should be better for
> random IOPS.
Interesting, I would have thought 15K drives would be available, but all
my spinners are basically consumer stuff. ^o^
Either way, you are wasting the link speed and controller price for an 1/3
increase in IOPS while the double write impact will make the resulting
IOPS per HDD lower than your option 1.

> >
> >> The budget in both configuration is similar, but the total capacity
> >> not. What would be the best configuration from the point of view of
> >> performance? In the second configuration I know the controller write
> >> back cache could be very critical, the servers has a LSI 3108
> >> controller with 2GB Cache. I have to plan this storage as a KVM image
> >> backend and the goal is the performance over the capacity.
> >>
> > Writeback cache can be very helpful, however it is not a miracle cure.
> > Not knowing your actual load and I/O patterns it might very well be
> > enough, though.
> The IO patterns are a bit unknown, I should assume 40% read and 60%
> write, but the IO size is unknown, because the storage is for KVM images
> and the VMs are for many customers and different purposes.

Ah, general purpose KVM. So you might get lucky or totally insane
customers.
Definitely optimize for speed (as in IOPS), monitor things constantly.
Be ready to upgrade your cluster at a moments notice, because once you
reach a threshold it is all downhill from there.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/