-----Original Message----- From: abperiasamy@xxxxxxxxx [mailto:abperiasamy@xxxxxxxxx] On Behalf Of Anand Babu Periasamy Sent: Wednesday, April 11, 2012 10:13 AM To: 7220022 Cc: gluster-devel@xxxxxxxxxx Subject: Re: GlusterFS Spare Bricks? On Tue, Apr 10, 2012 at 1:39 AM, 7220022 <7220022@xxxxxxxxx> wrote: > > Are there plans to add provisioning of spare bricks in a replicated (or distributed-replicated) configuration? E.g., when a brick in a mirror set dies, the system rebuilds it automatically on a spare, similar to how it'd done by RAID controllers. > > > > Nor would it only improve the practical reliability, especially of large clusters, but it'd also make it possible to make better-performing clusters off less expensive components. For example, instead of having slow RAID5 bricks on expensive RAID controllers one uses cheap HBA-s and stripes a few disks per brick in RAID0 - that's faster for writes than RAID 5/6 by an order of magnitude (and, by the way, should improve rebuild times in Gluster many are complaining about.). A failure of one such striped brick is not catastrophic in a mirrored Gluster - but it's better to have spare bricks standing by strewn across cluster heads. > > > > A more advanced setup at a hardware level involves creating "hybrid disks" whereas HDD vdisks are cached by enterprise-class SSD-s. It works beautifully and makes HDD-s amazingly fast for random transactions. The technology's become widely available for many $500 COTS controllers. However, it is not widely known that the results with HDD-s in RAID0 under SSD cache are 10 to 20 (!!) times better than with RAID 5 or 6. > > > > There is no way to use RAID0 in commercial storage, the main reason being the absence of hot-spares. If on the other hand the spares are handled by Gluster in a form of (cached hardware-RAID0) pre-fabricated bricks both very good performance and reasonably sufficient redundancy should be easily achieved. Why not use "gluster volume replace-brick ..." command. You can use external monitoring/management tools (eg. freeipmi) to detect node failures and trigger replace brick through a script. GlusterFS has the mechanism for hot spare, but the policy should be external. [AS] That should work, but still it'd be prone to human error. In our experience, if we've not had hotspares (block storage) we'd have surely experienced catastrophic failures. First-off, COTS disks (and controllers, if we talk GlusterFS nodes) have a break-in period when the bad ones fail under load within a few months. Secondly, a lot of our equipment is in remote telco facilities where power, cleanliness or airconditioning can be far from ideal - leading to increasing failure rates about 2 years after deployment. As a rule, we have at least 4 hotspares per two 24-bay enclosures, while our sister company with similar use profile does 4-6 spares per enclosure, as they run older and less uniform equipment. A node may come back online in 5 mins, GlusterFS should not automatically make decisions. [AS] Good point, e.g. down for maintenance I am thinking if it makes sense to add hot-spare as a standard feature, because GlusterFS detects failures. [AS] Given the reason above it'd be best if the feature could be turned on and off. Before attempting maintenance - turn off. Maintenance complete and node up - the "turn hotspare on" command is issued, but it's queued until the reconstruction of the node begins - and takes it into consideration (won't attempt to sync to spare bricks in case reconstruction to other good bricks has already began). In half the cases, failed disks and controllers fail randomly and temporarily (due to dust, bad power etc.) Most of the time the root cause is unknown or is impractical to debug in a live system. Block storage SAN-s have more or less standard configuration tools that also take that into account. Here's a brief description in their terminology, which may help creating the logic in GlusterFS: 1. Drives can have the statuses of Online, Unconfigured Good, Unconfigured Bad, Spare (LSP, a spare local to the drive group,) Global Spare (GSP, across the system) and Foreign. 2. vDisks can be Optimal, Degraded and Degraded, Rebuilding 3. In presence of spares, if a drive in a redundant vDisk fails the system marks the drive as Unconfigured Bad and the vDisk picks up the spare and enters the Rebuilding mode. 4. The system won't let you make an Unconfigured Bad drive Online. But you can try a "make unconfigured good" command on it. if successful, and it passes initialization and it won't show trouble in SMART - include it in a new vDisk, make it a spare, etc. If it's bad - replace it. -- Anand Babu Periasamy Blog [ http://www.unlocksmith.org ] Twitter [ http://twitter.com/abperiasamy ] Imagination is more important than knowledge --Albert Einstein