On Wed, 09 Nov 2011 17:50:00 +0100, Magnus Näslund
<magnus@xxxxxxxxxxxxxxx> wrote:
[...]
We want the data replicated at least 3 times physically (box-wise),
so we've ordered 3 test servers with 24x3TB "enterprise" SATA disks
each with an areca card + bbu. We'll probably be running the tests
feeding raid volumes to glusterfs, and from what I've seen this seems
to be a standard.
With that amount of space I hope you are going to be using something
like ZFS rather than normal RAID. Otherwise you are likely to find the
error rate will slowly and silently eat your data.
Possible future:
Since our storage system will be in it for a really long term, we're
looking at the total economics of the solution vs. the data safety
concerns.
We've seen suggestions on letting glusterfs manage the disk directly.
What exactly do you mean by that? GlusterFS requires a normal xattr
capable FS underneath it. Thus I presume you are referring to using GLFS
instead of RAID (i.e. stripe+distribute).
The way I see it, this would give a win in that
1) We would be using all disks, no RAID/spare storage overhead
2) No RAID-rebuilds
3) ...
4) Profit
Also, we know that any long time system we build should be planned
with replacing disks continuously.
My main concern with such data volumes would be the error rates of
modern disks. If your FS doesn't have automatic checking and block level
checksums, you will suffer data corruption, silent or otherwise. Quality
of modern disks is pretty appaling these days. One of my experiences is
here:
http://www.altechnative.net/?p=120
but it is by no means the only one.
Currently the only FS that meets all of my reliability criteria is ZFS
(and the linux port works quite well now), and it has saved me from data
corruption, silent and otherwise, a number of times by now, in cases
where normal RAID wouldn't have helped.
So in my mind we could buy quality boxes with 24-36 disks run by 3-4
SATA controller cards (Marvell?),
My experience with Marvell cards is limited. Do they have 8-port cards?
I use 8-port LSI cards without any serious problems. The only issue I
have seen is that they tend to reset the bus when the disk is slow to
respond (specifically due to running a SMART self-test), which means
that on one hand you lose the SMART short/long self-test option for
monitoring, but this is mitigated by weekly ZFS scrubs which I trust
more anyway.
using cheap and large desktop disks
(maybe not the "green" variety).
I would suggest you at the very least use disks that have
Write-Read-Verify capabilities. My recent experience shows that only
Seagates include this feature, even though, as it turns out, Samsung
seems to own the patent on it (and my Samsungs definitely don't have
that feature). If you do this, you may want to look into the WRV patch
for hdparm I submitted upstream, too, but there hasn't been a release of
it since then.
Another good idea is to use disks of similar spec from a different
manufacturer in different machines, and make sure that your glfs bricks
are mirrored so that they have different make disks under them.
We could have a reporting system on
top of glusterfs that reports defective disks that would be replaced
as part of our on-duty maintenance. Since the storage is replicated
over 3+ boxes, the breakage of a single disk would not compromise the
data safety as long as the disks are replaced in timely manner.
Bear in mind that your network bandwidth is unlikely to be as good as
your internal disk bandwidth, and restoring a 3TB brick by doing a "ls
-laR" is likely to take a very long time. So you may be better off with
RAIDZ2/RAIDZ3 or even just mirrored volumes in each of the machines,
distributed using glfs, in terms of single disk failure recovery time.
Anyway, to summarize:
1) With large volumes of data, you need something other than the disk's
sector checksums to keep your data correct, i.e. a checksum checking FS.
If you don't, expect to see silent data corruption sooner or later.
2) Don't use the same make of disk in all the servers - I have seen
multiple disks from the same manufacturer fail minutes apart more than
once.
3) Use WRV features of they are available.
4) Make sure your glfs bricks are mirrored between machines in such a
way that the underlying disks are different (e.g. say you have 24 disks
in each box, divided into 3x 8-disk RAIDZ3 volumes. Use each one of
those 8-disk volumes as a brick, and mirror it to a another similar
machine so that the 8 disks on the other server are from a different
manufacturer).
The glfs part on top is relatively straightforward and will "just work"
provided you use a reasonably sane configuration. It is the layers
underneath that you will need to get right to keep your data healthy.
Gordan