increasing stripe_cache_size decreases RAID-6 read throughput

Joe Williams <jwilliams315@xxxxxxxxx> · Sat, 24 Apr 2010 16:36:20 -0700

I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
parameters, including a 512 KB chunk size. It took about 6 hours to
initialize, then I created an XFS filesystem:

# mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
meta-data=/dev/md0               isize=256    agcount=32, agsize=45776384 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1464843648, imaxpct=5
         =                       sunit=128    swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.

Then it was time to optimize the performance. First I ran a benchmark
with the default settings (from a recent Arch linux install) for the
following parameters:

# cat /sys/block/md0/md/stripe_cache_size
256

# cat /sys/block/md0/queue/read_ahead_kb
3072
# cat /sys/block/sdb/queue/read_ahead_kb
128

# cat /sys/block/md0/queue/scheduler
none
# cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]

# cat /sys/block/md0/queue/nr_requests
128
# cat /sys/block/sdb/queue/nr_requests
128

# cat /sys/block/md0/device/queue_depth
cat: /sys/block/md0/device/queue_depth: No such file or directory
# cat /sys/block/sdb/device/queue_depth
31

# cat /sys/block/md0/queue/max_sectors_kb
127
# cat /sys/block/sdb/queue/max_sectors_kb
512

Note that sdb is one of the 5 drives for the RAID volume, and the
other 4 have the same settings.

First question, is it normal for the md0 scheduler to be "none"? I
cannot change it by writing, eg., "deadline" into the file.

Next question, is it normal for md0 to have no queue_depth setting?

Are there any other parameters that are important to performance that
I should be looking at?

I started the kernel with mem=1024M so that the buffer cache wasn't
too large (this machine has 72G of RAM), and ran an iozone benchmark:

    Iozone: Performance Test of File I/O
            Version $Revision: 3.338 $
        Compiled for 64 bit mode.
        Build: linux-AMD64

    Auto Mode
    Using Minimum Record Size 64 KB
    Using Maximum Record Size 16384 KB
    File size set to 4194304 KB
    Include fsync in write timing
    Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.
                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64  133608  114920   191367   191559    7772
14718
         4194304     128  142748  113722   165832   161023   14055
20728
         4194304     256  127493  108110   165142   175396   24156
23300
         4194304     512  136022  112711   171146   165466   36147
25698
         4194304    1024  140618  110196   153134   148925   57498
39864
         4194304    2048  137110  108872   177201   193416   98759
50106
         4194304    4096  138723  113352   130858   129940   78636
64615
         4194304    8192  140100  114089   175240   168807  109858
84656
         4194304   16384  130633  116475   131867   142958  115147
102795

I was expecting a little faster sequential reads, but 191 MB/s is not
too bad. I'm not sure why it decreases to 130-131 MB/s at larger
record sizes.

But the writes were disappointing. So the first thing I tried tuning
was stripe_cache_size

# echo 16384 > /sys/block/md0/md/stripe_cache_size

I re-ran the iozone benchmark:

                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64  219206  264113   104751   108118    7240
12372
         4194304     128  232713  255337   153990   142872   13209
21979
         4194304     256  229446  242155   132753   131009   20858
32286
         4194304     512  236389  245713   144280   149283   32024
44119
         4194304    1024  234205  243135   141243   141604   53539
70459
         4194304    2048  219163  224379   134043   131765   84428
90394
         4194304    4096  226037  225588   143682   146620   60171
125360
         4194304    8192  214487  231506   135311   140918   78868
156935
         4194304   16384  210671  215078   138466   129098   96340
178073

And now the sequential writes are quite satisfactory, but the reads
are low. Next I tried 2560 for stipe_cache_size, since that is

                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html