Re: Software RAID checksum performance on 24 disks not even close to kernel reported

Ole Tange <ole@xxxxxxxx> · Tue, 5 Jun 2012 22:57:11 +0200

On Tue, Jun 5, 2012 at 1:25 PM, Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx> wrote:
:
> It does not change much of the conclusions as to the (euphemism)
> audacity of your conclusions), but you have created a 21+2 RAID6
> set, as the 24th block device is a spare:
>
>  seq 24 | parallel -X --tty mdadm --create --force /dev/md0 -c $CHUNK --level=6 --raid-devices=23 -x 1 /dev/loop{}

That is correct. It reflects the physical setup of the 24 physical drives.

>>>> I get 400 MB/s write and 600 MB/s read. It seems to be due
>>>> to checksuming, as I have a single process (md0_raid6)
>>>> taking up 100% of one CPU.
>
> [ ... ]
>
>> The 900 MB/s was based on my old controller. I re-measured
>> using my new controller and get closer to 2000 MB/s in raw
>> (non-RAID) performance, which is close to the theoretical
>> maximum for that controller (2400 MB/s). This indicated that
>> hardware is not a bottleneck.
>
> A 21+2 drive RAID6 set is (euphemism) brave, and perhaps it
> matches the (euphemism) strategic insight that only checksumming
> withing MD could account for 100% CPU time in a single threaded
> way.

It is not a guess that md0_raid6 takes up 100% of 1 core. It is
reported by 'top'.

But maybe you are right: The 100% that md0_raid6 uses could be due to
something other than checksumming. But the test clearly show that
chunk size has a huge impact on the amount of CPU time md0_raid6 has
to use.

> But as a start you could try running your (euphemism) "test"
> with O_DIRECT:
>
>  http://www.sabi.co.uk/blog/0709sep.html#070919
>
> While making sure that the IO is stripe aligned (21 times the
> chunk size).

It is unclear to me how to change the timed part of the test script to
use O_DIRECT and make it stripe aligned:

seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f
linux.tar\; sync
seq 10 | time parallel mkdir -p /mnt/md0/{}\;cp linux.tar /mnt/md0/{} \; sync

Please advice.

> Your (euphemism) tests could also probably benefit from more
> care about (euphemism) details like commit semantics, as the use
> of 'sync' in your scripts seems to me based on (euphemism)
> unconventional insight, for example this:
>
>  «seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f linux.tar\; sync»

Feel free to substitute with:

seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f linux.tar
time sync

Here you will have to add the two durations.

With that modification I get:

Chunk size	Time to copy 10 linux kernel sources as files	Time to copy
10 linux kernel sources as a single tar file
16        29s        13s
32        28s        11s
64        29s        13s
128        34s        10s
256        41s        11s
4096        1m35s        2m15s (!)

Most numbers are comparable to the original results
http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

The 2m15s result for the 4096 big-file test was a bit surprising, so
I re-ran that test and got 2m36s.

> But also more divertingly:
>
>  «seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
>  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}
>  sync
>  sleep 1;
>  sync»

Are you aware that this part is for the setup of the test? It is not
the timed section and thus it does not affect the validity of the
test.

> and even:
>
>  «mount /dev/md0 /mnt/md0
>  sync»

Yeah that part was a bit weird, but I had 1 run where the script
failed without the 'sync'. And again: Are you aware that this part is
for the setup of the test? It is not the timed section and thus does
not change the validity of the test.

> Perhaps you might also want to investigate the behaviour of
> 'tmpfs' and 'loop' devices, as it seems quite (euphemism)
> creative to me to have RAID set member block devices as 'loop's
> over 'tmpfs' files:
>
>  «mount -t tmpfs tmpfs tmpfs
>  seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
>  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}»

How would YOU design the test so that:

* it is reproducible for others?
* it does not depend on controllers and disks?
* it uses 24 devices?
* it uses different chunk sizes?
* it tests both big and small file performance?

> Put another way, most aspects of your (euphemism) tests seem to
> me rather (euphemism) imaginative.

Did you run the test script? What were your numbers? Did md0_raid6
take up 100% CPU of 1 core during the copy? And if so: Can you explain
why md0_raid6 would take up 100% CPU of 1 core?

/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html