Re: GlusterFS Spare Bricks?

Anand Babu Periasamy <ab@xxxxxxxxxxx> · Thu, 12 Apr 2012 08:56:28 -0700



> -----Original Message-----
> From: abperiasamy@xxxxxxxxx [mailto:abperiasamy@xxxxxxxxx] On Behalf Of
> Anand Babu Periasamy
> Sent: Wednesday, April 11, 2012 10:13 AM
> To: 7220022
> Cc: gluster-devel@xxxxxxxxxx
> Subject: Re: GlusterFS Spare Bricks?
>
> On Tue, Apr 10, 2012 at 1:39 AM, 7220022 <7220022@xxxxxxxxx> wrote:
> >
> > Are there plans to add provisioning of spare bricks in a replicated (or
> distributed-replicated) configuration? E.g., when a brick in a mirror set
> dies, the system rebuilds it automatically on a spare, similar to how it'd
> done by RAID controllers.
> >
> >
> >
> > Nor would it only improve the practical reliability, especially of large
> clusters, but it'd also make it possible to make better-performing
> clusters
> off less expensive components. For example, instead of having slow RAID5
> bricks on expensive RAID controllers one uses cheap HBA-s and stripes a
> few
> disks per brick in RAID0 - that's faster for writes than RAID 5/6 by an
> order of magnitude (and, by the way, should improve rebuild times in
> Gluster
> many are complaining about.).  A failure of one such striped brick is not
> catastrophic in a mirrored Gluster - but it's better to have spare bricks
> standing by strewn across cluster heads.
> >
> >
> >
> > A more advanced setup at a hardware level involves creating "hybrid
> > disks"
> whereas HDD vdisks are cached by enterprise-class SSD-s.  It works
> beautifully and makes HDD-s amazingly fast for random transactions.  The
> technology's become widely available for many $500 COTS controllers.
> However, it is not widely known that the results with HDD-s in RAID0 under
> SSD cache are 10 to 20 (!!) times better than with RAID 5 or 6.
> >
> >
> >
> > There is no way to use RAID0 in commercial storage, the main reason
> > being
> the absence of hot-spares.  If on the other hand the spares are handled by
> Gluster in a form of (cached hardware-RAID0) pre-fabricated bricks both
> very
> good performance and reasonably sufficient redundancy should be easily
> achieved.
>
> Why not use "gluster volume replace-brick ..." command. You can use
> external
> monitoring/management tools (eg. freeipmi) to detect node failures and
> trigger replace brick through a script. GlusterFS has the mechanism for
> hot
> spare, but the policy should be external.
>
> [AS] That should work, but still it'd be prone to human error.  In our
> experience, if we've not had hotspares (block storage) we'd have surely
> experienced catastrophic failures.  First-off, COTS disks (and
> controllers,
> if we talk GlusterFS nodes) have a break-in period when the bad ones fail
> under load within a few months.  Secondly, a lot of our equipment is in
> remote telco facilities where power, cleanliness or airconditioning can be
> far from ideal - leading to  increasing failure rates about 2 years after
> deployment.  As a rule, we have at least 4 hotspares per two 24-bay
> enclosures, while our sister company with similar use profile does 4-6
> spares per enclosure, as they run older and less uniform equipment.
>
> A node may come back online in 5 mins, GlusterFS should not automatically
> make decisions.
> [AS] Good point, e.g. down for maintenance
>
>  I am thinking if it makes sense to add hot-spare as a standard feature,
> because GlusterFS detects failures.
>
> [AS] Given the reason above it'd be best if the feature could be turned on
> and off.  Before attempting maintenance - turn off.  Maintenance complete
> and node up - the "turn hotspare on" command is issued, but it's queued
> until the reconstruction of the node begins - and takes it into
> consideration (won't attempt to sync to spare bricks in case
> reconstruction
> to other good bricks has already began).
>
> In half the cases, failed disks and controllers fail randomly and
> temporarily (due to dust, bad power etc.)  Most of the time the root cause
> is unknown or is impractical to debug in a live system.  Block storage
> SAN-s
> have more or less standard configuration tools that also take that into
> account.  Here's a brief description in their terminology, which may help
> creating the logic in GlusterFS:
>
> 1. Drives can have the statuses of Online, Unconfigured Good, Unconfigured
> Bad, Spare (LSP, a spare local to the drive group,) Global Spare (GSP,
> across the system) and Foreign.
> 2. vDisks can be Optimal, Degraded and Degraded, Rebuilding
> 3.  In presence of spares, if a drive in a redundant vDisk fails the
> system
> marks the drive as Unconfigured Bad and the vDisk picks up the spare and
> enters the Rebuilding mode.
> 4.  The system won't let you make an Unconfigured Bad drive Online.  But
> you
> can try a "make unconfigured good" command on it.  if successful, and it
> passes initialization and it won't show trouble in SMART - include it in a
> new vDisk, make it a spare, etc.  If it's bad - replace it.
>

Very useful points. Took notes.
-ab