On Thu, 21 Aug 2008, Peter Grandi wrote:
[ ... ] For the 3 x 4disk raid0s the values were ~390MB
Writeback and ~14MB Dirty. Aggretate write rate 690MB/sec.
For the 1 x 12disk raid0 just ~14MB Writeback and ~190MB
Dirty. Write rate 473 MB/sec. [ ... ]
There was an old Microsoft research article from 2004/2005:
http://research.microsoft.com/barc/Sequential_IO/
Linear scaling with more drives. Versus kernel 2.6.22/.24 on our test
machine where the rate saturates already with ~5 drives in raid0. I'm not
familiar enough with md source to know what to fix.
That coincides with my own experience. But note that it is a very
special case, where optimum speed is reached only if requests hit
the member block devices in exactly the "right" way. In most
workloads latency also is a big part, so the ability to issue long
sequential streams of back-to-back requests is less important.
Data streaming is quite commonplace. The kernel should be able to handle
it efficiently.
Getting requests hit the "right" way should be deterministic. Achievable
with correct settings. I have already tried a few tunings, to no avail.
Specifically:
QT=64
NCQD=8
for drv in $@; do
echo $QT > /sys/block/${drv}/queue/iosched/quantum
cat /sys/block/${drv}/queue/max_hw_sectors_kb > /sys/block/${drv}/queue/max_sectors_kb
blockdev --setra 16384 /dev/${drv}
echo "noop" > /sys/block/${drv}/queue/scheduler
echo "${NCQD}" > /sys/block/${drv}/device/queue_depth
echo "$(($NCQD * 2))" > /sys/block/${drv}/device/nr_requests
done
On average this doubles the sequential read performance. But not with
sequential write.
Set "nop" elevator on the md slaves. No difference.
Additionally tried telling the kernel that sequential I/O is expected
(madvise, posix_fadvise) and trying to bypass some caching w O_DIRECT,
even tested mmap() and msync(), and tuned /proc/sys/vm/* settings. Always
doing raw I/O to /dev/mdX.
But the kernel or md or pdflush do not "get" any of the obvious hints that
they could dispatch requests optimally.
About elevators, MD itself has no "input side" elevator setting AFAICT.
Perhaps this is the problem (?)
Overall the current Linux block layer instead of being based on
streaming seems to be based on batching of requests, and this
interacts pooly with MD, as the batch sizes and their passing on
timing might not be those that can best keep several MD member
block devices busy.
Sounds plausible.
To see the detailed effects of all these layers of "brilliant"
policies, IO rates on individual devices should be looked at, and
I use something like this command line (with a very tall terminal
window).
watch -n2 iostat -d -m /dev/mdXXX /dev/sdNNN /dev/sdMMM ... 1 2
Hmm, I tried it. With a single 4-disk raid0 2048k chunk, all disks behind
same PMP:
$ grep MemTotal /proc/meminfo
MemTotal: 4044384 kB
screen1$ dd if=/dev/zero of=/dev/md0 bs=2048k
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e} 1 2
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 865.00 0.00 45.80 0 45
sdc 881.00 0.00 45.95 0 45
sdd 881.00 0.00 45.87 0 45
sde 898.00 0.00 45.55 0 45
With the 12-disk raid0, with linear scaling expected to do 12*46MB/s
or 552MB/s:
screen1$ dd if=/dev/zero of=/dev/md0 bs=2048k
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 473.00 0.00 17.71 0 17
sdc 513.00 0.00 18.88 0 18
sdd 495.00 0.00 19.41 0 19
sde 520.00 0.00 19.12 0 19
sdf 536.00 0.00 19.37 0 19
sdg 479.00 0.00 19.39 0 19
sdh 489.00 0.00 19.10 0 19
sdi 440.00 0.00 19.43 0 19
sdj 468.00 0.00 18.68 0 18
sdk 428.00 0.00 17.40 0 17
sdl 404.00 0.00 17.58 0 17
sdm 464.00 0.00 17.14 0 17
screen1$ # 42960158720 bytes (43 GB) copied, 210.794 s, 204 MB/s
So the rates are quite evenly distributed within maybe +-1.5 MB/s from the
mean.
My experience is that often the traffic is not quite evenly
balanced across the member drives, and when it is, the rate
sometimes is well below the one at which the member device
can operate.
Perhaps what I see here is yet another problem.
Now, things get curious and fast with O_DIRECT and a block size up to half
of the available 4GB memory. Without O_DIRECT the rates are always quite
low.
screen1$ dd if=/dev/zero of=/dev/md0 bs=16M oflag=direct
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 56.00 0.00 28.00 0 28
...
sdm 52.00 0.00 26.00 0 26
screen1$ # 14612955136 bytes (15 GB) copied, 44.7189 s, 327 MB/s
screen1$ sudo dd if=/dev/md0 of=/dev/null bs=16M iflag=direct
screen1$ # 14092861440 bytes (14 GB) copied, 31.681 s, 445 MB/s
------------------------------------------------------------------
screen1$ dd if=/dev/zero of=/dev/md0 bs=64M oflag=direct
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 70.00 0.00 35.00 0 35
...
sdm 65.00 0.00 34.50 0 34
screen1$ # 12683575296 bytes (13 GB) copied, 28.4022 s, 447 MB/s
screen1$ sudo dd if=/dev/md0 of=/dev/null bs=64M iflag=direct
screen1$ # 79054241792 bytes (79 GB) copied, 128.795 s, 614 MB/s
------------------------------------------------------------------
screen1$ dd if=/dev/zero of=/dev/md0 bs=512M oflag=direct
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
sdb 25.00 0.00 42.00 0 42
...
sdm 23.00 0.00 42.00 0 42
screen1$ # 25769803776 bytes (26 GB) copied, 53.3836 s, 483 MB/
screen1$ sudo dd if=/dev/md0 of=/dev/null bs=512M iflag=direct
screen1$ # 23622320128 bytes (24 GB) copied, 36.4848 s, 647 MB/s
------------------------------------------------------------------
screen1$ sudo dd if=/dev/zero of=/dev/md0 bs=1024M oflag=direct
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
sdb 17.00 0.00 24.50 0 24
...
sde 14.00 0.00 19.50 0 19
sdf 17.00 0.00 24.50 0 24
...
sdk 14.00 0.00 11.50 0 11
sdl 15.00 0.00 26.50 0 26
sdm 15.00 0.00 26.50 0 26
screen1$ # the rates jump around between 20MB/s, 60MB/s
screen1$ # 20401094656 bytes (20 GB) copied, 41.9148 s, 487 MB/s
screen1$ sudo dd if=/dev/md0 of=/dev/null bs=1024M iflag=direct
screen1$ # 23622320128 bytes (24 GB) copied, 36.4848 s, 647 MB/s
------------------------------------------------------------------
screen1$ dd if=/dev/zero of=/dev/md0 bs=2048M oflag=direct
screen2$ watch -n2 iostat -d -m /dev/sd{b,c,d,e,f,g,h,i,j,k,l,m} 1 2
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 72.00 0.00 53.00 0 53
...
sdm 74.00 0.00 52.00 0 52
screen1$ # the above happens in bursts and there are pauses with 0MB/s
screen1$ # 19327315968 bytes (19 GB) copied, 41.8337 s, 462 MB/s
screen1$ sudo dd if=/dev/md0 of=/dev/null bs=2048M iflag=direct
screen1$ # 27917234176 bytes (28 GB) copied, 46.4147 s, 601 MB/s
I guess large writes "force" something in the vm...
- Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html