Re: Correct RAID options

Chris Knipe <savage@xxxxxxxxxxxxx> · Wed, 20 Aug 2014 03:24:36 +0200

On Wed, Aug 20, 2014 at 2:22 AM, David Brown <david.brown@xxxxxxxxxxxx> wrote:

> In general, a 15 disk raid5 array is asking for trouble.  At least make it
> raid6.

At this stage the IO load on the archiver with the 15 disk RAID5 is
-very- minimal.  It's not even writing 8MB/s currently as the front
end RAID10 servers are obviously severely hampered whilst doing the
concurrent read/write requests. Now that it is our peak times, load
averages shoot up to over 80 due to IO wait from time to time, so this
is kinda critical for me right now :-(

Just a bit more background as was asked in the other replies...
Front end Servers are Dell PowerEdge R720 DX150s (8 x 4TB SATA-III,
64GB Ram, and Dual Xeon E5-2620 Hex-Core @ 2.00GHz)
The archiver is custom built (no brand name) and consists of the 15 x
4TB SATA-II drives, 32GB Ram, and a single Xeon E3-1245 Quad-Core @
3.3Ghz

Now the archiver we added is new - so I can't really comment at this
stage on how it is performing as it is not getting any real work from
the front ends.  During our standard benching (hdparm / dd / bonnie)
with no load on the archiver in terms of IO, performance was more than
adequate.

In terms of the front-ends with our "normal" load distribution of a
70/30 split between writes/reads, there's no serious performance
problems.  With over 500 concurrent application threads per server
accessing the files on the disks, load averages are generally around
the 3 to 5 range, with very minimal IO wait.  Munin reports "disk
utilization" between 20% and 30%, "disk latency" sub 100ms, and "disk
throughput" at about 30MB/s if I have to average all of this out.

Since we've now started to move data from the front ends to the
archiver, we have obviously thrown the 70/30 split out of the window,
and all stats are basically now off the charts. "disk utilization" is
averaging between 90% to 100%. The reading of the data from the front
end servers is obviously causing a bottleneck, and I can confirm this
seeing that as soon as we stop the archiving process that reads the
data on the front ends and writes it to the archiver, the load on the
servers return to normal.

In terms of adding more front end servers - it is definitely an option
yes.  Being brand name servers they do come at a premium however so I
would ideally like to have this as a last resort.  The premium cost,
together with the limited storage capacity basically made us opt to
rather try and offload some of the storage requirements to cheaper
alternatives (more than double the capacity - even at RAID10, for less
than half the price - realistically, we will be more than happy with
half the performance as well, so I'm not expecting miracles either).

RAID rebuilds are already problematic on the front end servers (RAID
10 over 8 x 4TB) with a single drive failure whilst the server is
under load takes approximately 8 odd hours to rebuild if memory serves
me correctly.  We've had a few failures in the past (even a double
drive failure at the same time), but nothing recent that I can recall
accurately.

I was never aware that bigger block sizes would increase read
performance though - this is interesting and something I can
definitely explore.  I am talking under correction, but I believe the
MegaRAIDs we're using can even go bigger than 1mbyte blocks.  I'll
have to check on this.  Bigger blocks does mean wasting more space
though if the files written are smaller and can't necessarily fill up
an entire block, right?  I suppose when you start talking about 12TB
and 50TB arrays, the amount of wasted space really becomes
insignificant, or am I mistaken?

SANs unfortunately is out of the question as this is hosted
infrastructure at a provider that does not offer SANs as part of their
product offerings.

> But the general idea is to have a set of raid1 mirrors (or possible Linux md
> raid10,far2 pairs if the traffic is read-heavy), and then tie them all
> together using a linear concatenation rather than raid0 stripes.  When you

Can I perhaps ask that you just elaborate a bit on what you mean by
linear concatenation?  I am presuming you are not referring to RAID 10
'per say' here as to your comment to use this rather than RAID 0
stripes.  XFS by itself, is also a good option - I honestly do not
know why this wasn't given consideration when we initially set the
machines up.  By the sound of it, all of them are now going to be
facing a rebuild.

> I am assuming your files are fairly small - if your reads or writes are
> often smaller than a full stripe of raid10 or raid5, performance will suffer
> greatly compared to XFS on a linear concat.

The files are VERY evenly distributed using md5 hashes.  We have 16
top level directories, 255 second level directories, and 4094 third
level directories.  Each third level directory currently holds between
4K and 4.5K files per directory (the archiver servers should have
roughly three or four times that amount once the disks are full).
Files are generally between 250kb and 750kb, a small percentage are a
bit larger to the 1.5mb range, and I can almost guarantee that not one
single file will exceed the 5mb range.  I'm not sure what the stripe
size is at this stage but it is more than likely what ever the default
is for the controller (64kb?)

I think to explore XFS would need to be my first port of call here.
Take one of the front ends out of production tomorrow when load has
quieted down, trash it, and rebuild it.  Then we'll more than likely
need 2 or 3 weeks for the disks to fill up again with files before
we're really going to see how it compares.

If I can perhaps just get some clarity in terms of the physical disk
layouts / configurations that you would recommend, I would appreciate
it greately.  You're obviously not talking about a simple RAID 10
array here, even though I think just XFS over EXT4 would already do us
wonders.

Many thanks for all the responces!

--
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html