Re: Slow disks.

Greg Freemyer <greg.freemyer@xxxxxxxxx> · Sun, 26 Dec 2010 18:05:05 -0500

On Fri, Dec 24, 2010 at 6:40 AM, Rogier Wolff <R.E.Wolff@xxxxxxxxxxxx> wrote:
> On Thu, Dec 23, 2010 at 05:09:43PM -0500, Greg Freemyer wrote:
>> On Thu, Dec 23, 2010 at 2:10 PM, Jaap Crezee <jaap@xxxxxx> wrote:
>> > On 12/23/10 19:51, Greg Freemyer wrote:
>> >> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<jmoyer@xxxxxxxxxx>  wrote:
>> >> I suspect a mailserver on a raid 5 with large chunksize could be a lot
>> >> worse than 2x slower.  But most of the blame is just raid 5.
>> >
>> > Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage
>> > space of one disk. I am using some other servers with raid 5 md's which
>> > seems to be running just fine; even under higher load than the machine we
>> > are talking about.
>> >
>> > Looking at the vmstat block io the typical load (both write and read) seems
>> > to be less than 20 blocks per second. Will this drop the performance of the
>> > array (measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?
>> >
>>
>> You clearly have problems more significant than your raid choice, but
>> hopefully you will find the below informative anyway.
>>
>> ====
>>
>> The above is a meaningless performance tuning test for a email server,
>> but assuming it was a useful test for you:
>>
>> With bs=1MB you should have optimum performance with a 3-disk raid5
>> and 512KB chunks.
>>
>> The reason is that a full raid stripe for that is 1MB  (512K data +
>> 512K data + 512K parity = 1024K data)
>>
>> So the raid software should see that as a full stripe update and not
>> have to read in any of the old data.
>>
>> Thus at the kernel level it is just:
>>
>> write data1 chunk
>> write data2 chunk
>> write parity chunk
>>
>> All those should happen in parallel, so a raid 5 setup for 1MB writes
>> is actually just about optimal!
>
> You are assuming that the kernel is blind and doesn't do any
> readaheads. I've done some tests and even when I run dd with a
> blocksize of 32k, the average request sizes that are hitting the disk
> are about 1000k (or 1000 sectors I don't know what units that column
> are in when I run with -k option).

dd is not a benchmark tool.

You are building a email server that does 4KB random writes.
Performance testing / tuning with dd is of very limited use.

For your load, read ahead is pretty much useless!

> So your argument that "it fits exactly when your blocksize is 1M, so
> it is obvious that 512k blocksizes are optimal" doesn't hold water.

If you were doing a real i/o benchmark, then 1MB random writes
perfectly aligned to the Raid stripes would be perfect.  Raid really
needs to be designed around the i/o pattern, not just optimizing dd.

<snip>

>> Anything smaller than a 1 stripe write is where the issues occur,
>> because then you have the read-modify-write cycles.
>
> Yes. But still they shouldn't be as heavy as we are seeing.  Besides
> doing the "big searches" on my 8T array, I also sometimes write "lots
> of small files". I'll see how many I can mange on that server....

<snip>
>
> You're repeating what WD says about their enterprise drives versus
> desktop drives. I'm pretty sure that they believe what they are saying
> to be true. And they probably have done tests to see support for their
> theory. But for Linux it simply isn't true.

What kernel are you talking about.  mdraid has seen major improvements
in this area in the last 2 o3 years or so.  Are you using a old kernel
by chance?  Or reading old reviews?

> We see MUCH too often raid arrays that lose a drive evict it from the
> RAID and everything keeps on working, so nobody wakes up. Only after a
> second drive fails, things stop working and the datarecovery company
> gets called into action. Often we have a drive with a few bad blocks
> and months-old data, and a totally failed drive which is neccesary for
> a full recovery. It's much better to keep the failed/failing drive in
> the array and up-to-date during the time that you're pushing the
> operator to get it replaced.
>
>        Roger.

The linux-raid mailing list is very helpful.  If you're seeing
problems, ask for help there.

What your describing simply sounds wrong.  (At least for mdraid, which
is what I assume you are using.)

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html