RE: Correct RAID options

Craig Curtin <craigc@xxxxxxxxxxxxx> · Wed, 20 Aug 2014 02:38:27 +0000

Chris,

I assume that your application will handle it OK if the Archive is offline ? I understand you have throughput issues now, but if you lose the RAID5 setup it is going to take a long time to recover really need to rebuild that as RAID6 ASAP - particularly if it is only under light load now.

Now that you have clarified what your servers are and how they are performing - I would suggest another option would be an external SATA storage system - for cost you can not go past the Promise equipment. You could whack additional SATA drives in one of these on your front end servers (they have models with different connection options - ESATA, SAS, FC etc etc) and this would give you more space on the front end as well as more spindles to handle the writing etc - it would also give you the ability to mess around with different file systems etc.

These Promise systems have inbuilt RAID controllers so you can format as appropriate and present them to the system as whatever disk you wish i.e. multiple logical disks, multiple LUNs etc etc)

Craig

-----Original Message-----
From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Chris Knipe
Sent: Wednesday, 20 August 2014 11:25 AM
To: David Brown
Cc: linux-raid@xxxxxxxxxxxxxxx
Subject: Re: Correct RAID options

On Wed, Aug 20, 2014 at 2:22 AM, David Brown <david.brown@xxxxxxxxxxxx> wrote:

> In general, a 15 disk raid5 array is asking for trouble.  At least
> make it raid6.

At this stage the IO load on the archiver with the 15 disk RAID5 is
-very- minimal.  It's not even writing 8MB/s currently as the front end RAID10 servers are obviously severely hampered whilst doing the concurrent read/write requests. Now that it is our peak times, load averages shoot up to over 80 due to IO wait from time to time, so this is kinda critical for me right now :-(

Just a bit more background as was asked in the other replies...
Front end Servers are Dell PowerEdge R720 DX150s (8 x 4TB SATA-III, 64GB Ram, and Dual Xeon E5-2620 Hex-Core @ 2.00GHz) The archiver is custom built (no brand name) and consists of the 15 x 4TB SATA-II drives, 32GB Ram, and a single Xeon E3-1245 Quad-Core @ 3.3Ghz

Now the archiver we added is new - so I can't really comment at this stage on how it is performing as it is not getting any real work from the front ends.  During our standard benching (hdparm / dd / bonnie) with no load on the archiver in terms of IO, performance was more than adequate.

In terms of the front-ends with our "normal" load distribution of a
70/30 split between writes/reads, there's no serious performance problems.  With over 500 concurrent application threads per server accessing the files on the disks, load averages are generally around the 3 to 5 range, with very minimal IO wait.  Munin reports "disk utilization" between 20% and 30%, "disk latency" sub 100ms, and "disk throughput" at about 30MB/s if I have to average all of this out.

Since we've now started to move data from the front ends to the archiver, we have obviously thrown the 70/30 split out of the window, and all stats are basically now off the charts. "disk utilization" is averaging between 90% to 100%. The reading of the data from the front end servers is obviously causing a bottleneck, and I can confirm this seeing that as soon as we stop the archiving process that reads the data on the front ends and writes it to the archiver, the load on the servers return to normal.

In terms of adding more front end servers - it is definitely an option yes.  Being brand name servers they do come at a premium however so I would ideally like to have this as a last resort.  The premium cost, together with the limited storage capacity basically made us opt to rather try and offload some of the storage requirements to cheaper alternatives (more than double the capacity - even at RAID10, for less than half the price - realistically, we will be more than happy with half the performance as well, so I'm not expecting miracles either).

RAID rebuilds are already problematic on the front end servers (RAID
10 over 8 x 4TB) with a single drive failure whilst the server is under load takes approximately 8 odd hours to rebuild if memory serves me correctly.  We've had a few failures in the past (even a double drive failure at the same time), but nothing recent that I can recall accurately.

I was never aware that bigger block sizes would increase read performance though - this is interesting and something I can definitely explore.  I am talking under correction, but I believe the MegaRAIDs we're using can even go bigger than 1mbyte blocks.  I'll have to check on this.  Bigger blocks does mean wasting more space though if the files written are smaller and can't necessarily fill up an entire block, right?  I suppose when you start talking about 12TB and 50TB arrays, the amount of wasted space really becomes insignificant, or am I mistaken?

SANs unfortunately is out of the question as this is hosted infrastructure at a provider that does not offer SANs as part of their product offerings.

> But the general idea is to have a set of raid1 mirrors (or possible
> Linux md
> raid10,far2 pairs if the traffic is read-heavy), and then tie them all
> together using a linear concatenation rather than raid0 stripes.  When
> you

Can I perhaps ask that you just elaborate a bit on what you mean by linear concatenation?  I am presuming you are not referring to RAID 10 'per say' here as to your comment to use this rather than RAID 0 stripes.  XFS by itself, is also a good option - I honestly do not know why this wasn't given consideration when we initially set the machines up.  By the sound of it, all of them are now going to be facing a rebuild.

> I am assuming your files are fairly small - if your reads or writes
> are often smaller than a full stripe of raid10 or raid5, performance
> will suffer greatly compared to XFS on a linear concat.

The files are VERY evenly distributed using md5 hashes.  We have 16 top level directories, 255 second level directories, and 4094 third level directories.  Each third level directory currently holds between 4K and 4.5K files per directory (the archiver servers should have roughly three or four times that amount once the disks are full).
Files are generally between 250kb and 750kb, a small percentage are a bit larger to the 1.5mb range, and I can almost guarantee that not one single file will exceed the 5mb range.  I'm not sure what the stripe size is at this stage but it is more than likely what ever the default is for the controller (64kb?)

I think to explore XFS would need to be my first port of call here.
Take one of the front ends out of production tomorrow when load has quieted down, trash it, and rebuild it.  Then we'll more than likely need 2 or 3 weeks for the disks to fill up again with files before we're really going to see how it compares.

If I can perhaps just get some clarity in terms of the physical disk layouts / configurations that you would recommend, I would appreciate it greately.  You're obviously not talking about a simple RAID 10 array here, even though I think just XFS over EXT4 would already do us wonders.

Many thanks for all the responces!

--
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html

Disclaimer

CONFIDENTIAL

This message contains confidential information and is intended only for the intended recipients. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system.

Disclaimer

CONFIDENTIAL

This message contains confidential information and is intended only for the intended recipients. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system.
��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f