Re: GlusterFS Spare Bricks?

Gordan Bobic <gordan@xxxxxxxxxx> · Tue, 10 Apr 2012 19:50:14 +0100

On 10/04/2012 09:39, 7220022 wrote:
Are there plans to add provisioning of spare bricks in a replicated
(or
distributed-replicated) configuration? E.g., when a brick in a mirror
set dies, the system rebuilds it automatically on a spare, similar to
how it'd done by RAID controllers.

Nor would it only improve the practical reliability, especially of
large clusters, but it'd also make it possible to make
better-performing clusters off less expensive components. For example,
instead of having slow RAID5 bricks on expensive RAID controllers one
uses cheap HBA-s and stripes a few disks per brick in RAID0 - that's
faster for writes than RAID 5/6 by an order of magnitude (and, by the
way, should improve rebuild times in Gluster many are complaining
about.).A failure of one such striped brick is not catastrophic in a
mirrored Gluster - but it's better to have spare bricks standing by strewn
across cluster heads.

A more advanced setup at a hardware level involves creating "hybrid
disks" whereas HDD vdisks are cached by enterprise-class SSD-s.It
works beautifully and makes HDD-s amazingly fast for random
transactions.The technology's become widely available for many $500
COTS controllers.However, it is not widely known that the results with
HDD-s in RAID0 under SSD cache are 10 to 20 (!!) times better than
with RAID 5 or 6.

On reads the difference should be negligible unless the array is degraded.
If it's not, your RAID controller is unfit for purpose.

[AS] I refer to random IOPS in 70K to 200K range  on vdisks in RAID 0 vs. 5
behind large SSD cache.

But are they read or read-write IOPS? RAID5/6 is going to hammer you on 
random writes because of the RMW overheads, unless your SSD is being 
used for write-behind caching all the writes (which could be deemed 
dangerous).

Behavior of such "hybrid vdisks" is different from
pure SSD or HDD-based ones.  Unlike that of the DDR RAM cache, the total R+W
bandwidth in MB/s of an SSD is limited at the level of its max. read-only
performance.  Hence the front-end read performance is degraded by the value
of the (sequential) write load onto the cache upstream from the HDD.  And
vice versa, the write performance of the hybrid gets degraded by the slow
write speed of a RAID 5/6 array behind cache - especially at larger queue
depths.

I'm not sure I grok what you are saying (if you are saying what I think 
you are saying). Surely any sane performance oriented setup would be 
write-behind caching on the SSD (and then flushing it out to the RAID 
array when there is some idle time).

Have you looked at flashcache? It's not as advanced as ZFS' L2ARC, but 
if for whatever reason ZFS isn't an option for you, it's still an 
improvement.

These limitations, when superposed by most "real-world" test
patterns leave the array just marginally better for both writes and reads
than an HDD-based RAID10 one with the same number of drives.  Not quite sure
why, but it's removing the write speed limit of the HDD-s by changing the
RAID level from 5 to 0 that clears the bottleneck.

If that is the case, then clearly your SSD isn't being used for write 
caching which removes most of the benefit you are going to get from it. 
See under RAID controller being unfit for purpose. :)

The relative difference
gets much higher for both reads and writes than the write performance gap
between pure HDD RAID 0 and 5 vdisks.

I can only guess that this is an artifact of something anti-clever that 
the RAID controller is doing. I gave on hardware RAID controllers over a 
decade ago for similar reasons.

Having said that, a lot of RAID controllers are pretty useless.

[AS] the newer LSI 2208-based ones seem okay and recent firmware/drivers
finally stable.

I'm not convinced. I have some LSI cards in several of my boxes and they 
very consistently drop disks that are running SMART tests in the 
background. I have yet to find a firmware or driver version that 
corrects this. There is a RHBZ ticket open for this somewhere but I 
can't seem to find it at the moment.

But I agree: we always leave out RAID features apart from
stripe or mirror and do everything by software.  Advanced features
(FastPath, CacheCade) though are fantastic if you use SSD-s, either
standalone or as HDD cache.  In fact we use controllers instead of simple
HBA-s only to take advantage of these features.

Your experience of the performance that you mention above shows that 
they aren't that great in a lot of cases. I've found that software RAID 
has been faster than hardware raid since before the turn of the century, 
and ZFS cuts off a few more corners.

There is no way to use RAID0 in commercial storage, the main reason
being the absence of hot-spares.If on the other hand the spares are
handled by Gluster in a form of (cached hardware-RAID0) pre-fabricated
bricks both very good performance and reasonably sufficient redundancy
should be easily achieved.

So why not use ZFS instead? The write performance is significantly better
than traditional RAID equivalents and you get vastly more flexibility than
with any hardware RAID solution. And it supports caching data onto SSDs.

[AS] Good point.  We have no experience though, but we should try.  Do you
know if it can be made distributed "parallel" such as Gluster and supports
RDM transport for storage traffic between heads?

In a word - no. I was referring to using ZFS as the backing FS for GLFS.

The main reason we've been
looking into Gluster is cheap bandwidth: all our servers and nodes are
connected via 40Gbit IB fabric, 2 ports per server, 4 on some larger ones,
non-blocking edge switches, directors at floor level etc - 80 to 90% idle.
Can you make global spares in ZFS?

No, ZFS is a single-node FS. It can replace your RAID + local FS stack, 
but you would still need to use GLFS on top to get the multi-node 
distributed features.

Gordan