CRUSH map advice

chibi@xxxxxxx (Christian Balzer) · Thu, 14 Aug 2014 16:47:58 +0900

Hello,

On Tue, 12 Aug 2014 10:53:21 -0700 Craig Lewis wrote:

> On Mon, Aug 11, 2014 at 11:26 PM, John Morris <john at zultron.com> wrote:
> 
> > On 08/11/2014 08:26 PM, Craig Lewis wrote:
> >
> >> Your MON nodes are separate hardware from the OSD nodes, right?
> >>
> >
> > Two nodes are OSD + MON, plus a separate MON node.
> >
> >
> >  If so,
> >> with replication=2, you should be able to shut down one of the two OSD
> >> nodes, and everything will continue working.
> >>
> >
> > IIUC, the third MON node is sufficient for a quorum if one of the OSD +
> > MON nodes shuts down, is that right?
> >
> 
> So yeah, if you lose any one node, you'll be fine.
> 
> 
> >
> > Replication=2 is a little worrisome, since we've already seen two disks
> > simultaneously fail just in the year the cluster has been running.
> > That statistically unlikely situation is the first and probably last
> > time I'll see that, but they say lightning can strike twice....
> 
> 
> That's a low probability, given the number of disks you have.  I would've
> taken that bet (with backups).  As the number of OSDs goes up, the
> probability of multiple simultaneous failures goes up, and slowly
> becomes a bad bet.
> 

I must be very unlucky then. ^o^ 
As in, I've had dual disk failures in a set of 8 disks 3 times now
(within the last 6 years). 
And twice that lead to data loss, once with RAID5 (no surprise there) and
once with RAID10 (unlucky failure of neighboring disks).
Granted, that was with consumer HDDs and the last one with rather well
aged ones, too. But there you go.

As for backups, those are for when somebody does something stupid and
deletes stuff they shouldn't have. 
A storage system should be a) up all the time and b) not loose data.

> 
> 
> >
> >
> >  Since it's for
> >> experimentation, I wouldn't deal with the extra hassle of
> >> replication=4 and custom CRUSH rules to make it work.  If you have
> >> your heart set on that, it should be possible.  I'm no CRUSH expert
> >> though, so I can't say for certain until I've actually done it.
> >>
> >> I'm a bit confused why your performance is horrible though.  I'm
> >> assuming your HDDs are 7200 RPM.  With the SSD journals and
> >> replication=3, you won't have a ton of IO, but you shouldn't have any
> >> problem doing > 100 MB/s with 4 MB blocks.  Unless your SSDs are very
> >> low quality, the HDDs should be your bottleneck.
> >>
> >
> > The below setup is tomorrow's plan; today's reality is 3 OSDs on one
> > node and 2 OSDs on another, crappy SSDs, 1Gb networks, pgs stuck
> > unclean and no monitoring to pinpoint bottlenecks.  My work is cut out
> > for me.  :)
> >
> > Thanks for the helpful reply.  I wish we could just add a third OSD
> > node and have these issues just go away, but it's not in the budget
> > ATM.
> >
That's really unfortunate, because it would solve a lot of your problems
and potential data loss issues.

If you can add HDDs (budget and space wise), consider running RAID1 for
OSDs for the time being and sleep easier with a replication of 2 until
you can add more nodes. 

Christian

> >
> Ah, yeah, that explains the performance problems.  Although, crappy SSD
> journals are still better than no SSD journals.  When I added SSD
> journals to my existing cluster, I saw my write bandwidth go from 10
> MBps/disk to 50MBps/disk.  Average latency dropped a bit, and the
> variance in latency dropped a lot.
> 
> Just adding more disks to your existing nodes would help performance,
> assuming you have room to add them.

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/