CRUSH map advice

chibi@xxxxxxx (Christian Balzer) · Fri, 15 Aug 2014 17:23:10 +0900

On Thu, 14 Aug 2014 12:07:54 -0700 Craig Lewis wrote:

> On Thu, Aug 14, 2014 at 12:47 AM, Christian Balzer <chibi at gol.com> wrote:
> >
> > Hello,
> >
> > On Tue, 12 Aug 2014 10:53:21 -0700 Craig Lewis wrote:
> >
> >> That's a low probability, given the number of disks you have.  I
> >> would've taken that bet (with backups).  As the number of OSDs goes
> >> up, the probability of multiple simultaneous failures goes up, and
> >> slowly becomes a bad bet.
> >>
> >
> > I must be very unlucky then. ^o^
> > As in, I've had dual disk failures in a set of 8 disks 3 times now
> > (within the last 6 years).
> > And twice that lead to data loss, once with RAID5 (no surprise there)
> > and once with RAID10 (unlucky failure of neighboring disks).
> > Granted, that was with consumer HDDs and the last one with rather well
> > aged ones, too. But there you go.
> 
> Yeah, I'd say you're unlucky, unless you're running a pretty large
> cluster. I usually run my 8 disk arrays in RAID-Z2 / RAID6 though; 5
> disks is my limit for RAID-Z1 / RAID5.
> 
> I've been lucky so far.  No double failures in my RAID-Z1 / RAID5 arrays,
> and no triple failures in my RAID-Z2 / RAID6 arrays.  After 15 years and
> hundreds of arrays, I should've had at least one.  I have had several
> double failures in RAID1, but none of those were important.
> 
> 
> If this isn't a big cluster, I would suspect that you have a vibration or
> power issue.  Both are known to cause premature death in HDDs.  Of
> course, rebuilding a degraded RAID is also a well known cause of
> premature HDD death.
>
Precisely, that's what happened one time.
And the last time the disks were all getting long in the tooth (test
machine, not production), however the disks that failed in rapid succession
weren't the most likely candidates based on SMART.
These weren't part of any (ceph) cluster at all and just cases (Tyan) with
8 disks in them. Could be vibrations, could be power (as in the PSU, the
actual line power is fine, DC quality). 
However the disks that died were all of a certain generation of Seagates
and died (just not as badly) in other cases (Supermicro), too. 

So I'm all for RAID6 or equivalent. 

Speaking of SMART, a juicy tidbit for all those people with mysterious
hot spot disks.
I'm currently deploying, burning in a new cluster and noticed during
initialization with atop and iostat that one disk was having much higher
svctm and await values than the rest.
Testing it with fio gave me close to 400 write IOPS on the good drives
with avio of 3ms during the test and just 42 write IOPS with a 31ms avio
for the lame one.
Nothing in SMART, including the performance counters suggested anything
out of the ordinary and these are new drives.
Consumer level drive or not, that one goes back to Toshiba.

> 
> 
> > As for backups, those are for when somebody does something stupid and
> > deletes stuff they shouldn't have.
> > A storage system should be a) up all the time and b) not loose data.
> 
> 
> I completely agree, but never trust it.
> 
> Over the years, I've used backups to recover when:
> 
>    - I do something stupid
>    - My developers do something stupid
>    - Hardware does something stupid
>    - Manufacturer firmware does something stupid
>    - Manufacturer Tech support tells me to do something stupid
>    - My datacenter does something stupid
>    - My power companies do something stupid
> 
> I've lost data from a software RAID0, all the way up to a
> quadruply-redundant multi-million dollar hardware storage array.
>  Regardless of the promises printed on the box, it's the contingency
> plans that keep the paychecks coming.
>
Indeed. 
But personally I'd blame myself if a production cluster went down because
of something that I know is likely enough to happen. ^^

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/