On 11/09/2011 06:51 PM, Gordan Bobic wrote:
My main concern with such data volumes would be the error rates of
modern disks. If your FS doesn't have automatic checking and block level
checksums, you will suffer data corruption, silent or otherwise. Quality
of modern disks is pretty appaling these days. One of my experiences is
here:
http://www.altechnative.net/?p=120
but it is by no means the only one.
Interesting read, and I agree that raid data corruption and hard disk
untrustworthiness issues being a huge problem. To combat this we're
thinking of using a crude health checking utility that would use
checksum files, on top of whatever we end up using (glusterfs or
otherwise). These scripts would be specific to our application, and file
based.
In glusterfs I believe that it would be possible to do the checksum
checking locally on the nodes, since the underlying filesystem is
accessible?
Currently the only FS that meets all of my reliability criteria is ZFS
(and the linux port works quite well now), and it has saved me from data
corruption, silent and otherwise, a number of times by now, in cases
where normal RAID wouldn't have helped.
We're using OpenSolaris+ZFS today in production, if glusterfs works well
on OpenSolaris that might very well be what we end up with.
We're a linux-shop, but we settled for OpenSolaris on ZFS alone.
Are you running glusterfs on Solaris or/and Linux in production?
So in my mind we could buy quality boxes with 24-36 disks run by 3-4
SATA controller cards (Marvell?),
My experience with Marvell cards is limited. Do they have 8-port cards?
I use 8-port LSI cards without any serious problems. The only issue I
have seen is that they tend to reset the bus when the disk is slow to
respond (specifically due to running a SMART self-test), which means
that on one hand you lose the SMART short/long self-test option for
monitoring, but this is mitigated by weekly ZFS scrubs which I trust
more anyway.
We're using LSI cards now aswell for the solaris servers, IIRC.
We'd use the cards with the best reputation.
[snip]
Anyway, to summarize:
1) With large volumes of data, you need something other than the disk's
sector checksums to keep your data correct, i.e. a checksum checking FS.
If you don't, expect to see silent data corruption sooner or later.
The silent corruption case can be mitigated an application specific way
for us, as described above. Having that automatically using ZFS is
definately interesting in several ways. Does glusterfs have (or plan to
have) a scrubbing-like functionality that checks the data?
2) Don't use the same make of disk in all the servers - I have seen
multiple disks from the same manufacturer fail minutes apart more than
once.
3) Use WRV features of they are available.
4) Make sure your glfs bricks are mirrored between machines in such a
way that the underlying disks are different (e.g. say you have 24 disks
in each box, divided into 3x 8-disk RAIDZ3 volumes. Use each one of
those 8-disk volumes as a brick, and mirror it to a another similar
machine so that the 8 disks on the other server are from a different
manufacturer).
The glfs part on top is relatively straightforward and will "just work"
provided you use a reasonably sane configuration. It is the layers
underneath that you will need to get right to keep your data healthy.
Gordan
These are all excellent points.
Thank you for the input!
Regards,
Magnus