Re: increasing stripe_cache_size decreases RAID-6 read throughput

Neil Brown <neilb@xxxxxxx> · Tue, 27 Apr 2010 16:41:26 +1000

On Sat, 24 Apr 2010 16:36:20 -0700
Joe Williams <jwilliams315@xxxxxxxxx> wrote:

> I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
> TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
> parameters, including a 512 KB chunk size. It took about 6 hours to
> initialize, then I created an XFS filesystem:
> 
> # mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=45776384 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=1464843648, imaxpct=5
>          =                       sunit=128    swidth=384 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.
> 
> Then it was time to optimize the performance. First I ran a benchmark
> with the default settings (from a recent Arch linux install) for the
> following parameters:
> 
> # cat /sys/block/md0/md/stripe_cache_size
> 256
> 
> # cat /sys/block/md0/queue/read_ahead_kb
> 3072

2 full stripes - that is right.

> # cat /sys/block/sdb/queue/read_ahead_kb
> 128

This number is completely irrelevant.  Only the read_ahead_kb of the device
that the filesystem sees is used.

> 
> # cat /sys/block/md0/queue/scheduler
> none
> # cat /sys/block/sdb/queue/scheduler
> noop deadline [cfq]
> 
> # cat /sys/block/md0/queue/nr_requests
> 128
> # cat /sys/block/sdb/queue/nr_requests
> 128
> 
> # cat /sys/block/md0/device/queue_depth
> cat: /sys/block/md0/device/queue_depth: No such file or directory
> # cat /sys/block/sdb/device/queue_depth
> 31
> 
> # cat /sys/block/md0/queue/max_sectors_kb
> 127
> # cat /sys/block/sdb/queue/max_sectors_kb
> 512
> 
> Note that sdb is one of the 5 drives for the RAID volume, and the
> other 4 have the same settings.
> 
> First question, is it normal for the md0 scheduler to be "none"? I
> cannot change it by writing, eg., "deadline" into the file.
> 

Because software-RAID is not disk drive, does not use and elevator and so
does not use a scheduler.
The whole 'queue' directory really shouldn't appear for md devices but for
some very boring reasons it does.

> Next question, is it normal for md0 to have no queue_depth setting?

Yes.  The stripe_cache_size is conceptually a similar think, but only
at a very abstract level.

> 
> Are there any other parameters that are important to performance that
> I should be looking at?

No.

> 
> I started the kernel with mem=1024M so that the buffer cache wasn't
> too large (this machine has 72G of RAM), and ran an iozone benchmark:
> 
>     Iozone: Performance Test of File I/O
>             Version $Revision: 3.338 $
>         Compiled for 64 bit mode.
>         Build: linux-AMD64
> 
>     Auto Mode
>     Using Minimum Record Size 64 KB
>     Using Maximum Record Size 16384 KB
>     File size set to 4194304 KB
>     Include fsync in write timing
>     Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
>     Output is in Kbytes/sec
>     Time Resolution = 0.000001 seconds.
>     Processor cache size set to 1024 Kbytes.
>     Processor cache line size set to 32 bytes.
>     File stride size set to 17 * record size.
>                                                             random
> random
>               KB  reclen   write rewrite    read    reread    read   write
>          4194304      64  133608  114920   191367   191559    7772
> 14718
>          4194304     128  142748  113722   165832   161023   14055
> 20728
>          4194304     256  127493  108110   165142   175396   24156
> 23300
>          4194304     512  136022  112711   171146   165466   36147
> 25698
>          4194304    1024  140618  110196   153134   148925   57498
> 39864
>          4194304    2048  137110  108872   177201   193416   98759
> 50106
>          4194304    4096  138723  113352   130858   129940   78636
> 64615
>          4194304    8192  140100  114089   175240   168807  109858
> 84656
>          4194304   16384  130633  116475   131867   142958  115147
> 102795
> 
> 
> I was expecting a little faster sequential reads, but 191 MB/s is not
> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> record sizes.

I don't know why it would decrease either.  For sequential reads, read-ahead
should be scheduling all the read requests and that actual reads should just
be waiting for the read-ahead to complete.  So there shouldn't be any
variability - clearly there is.  I wonder if it is an XFS thing....
care to try a different filesystem for comparison?  ext3?

> 
> But the writes were disappointing. So the first thing I tried tuning
> was stripe_cache_size
> 
> # echo 16384 > /sys/block/md0/md/stripe_cache_size
> 
> I re-ran the iozone benchmark:
> 
>                                                             random
> random
>               KB  reclen   write rewrite    read    reread    read   write
>          4194304      64  219206  264113   104751   108118    7240
> 12372
>          4194304     128  232713  255337   153990   142872   13209
> 21979
>          4194304     256  229446  242155   132753   131009   20858
> 32286
>          4194304     512  236389  245713   144280   149283   32024
> 44119
>          4194304    1024  234205  243135   141243   141604   53539
> 70459
>          4194304    2048  219163  224379   134043   131765   84428
> 90394
>          4194304    4096  226037  225588   143682   146620   60171
> 125360
>          4194304    8192  214487  231506   135311   140918   78868
> 156935
>          4194304   16384  210671  215078   138466   129098   96340
> 178073
> 
> And now the sequential writes are quite satisfactory, but the reads
> are low. Next I tried 2560 for stipe_cache_size, since that is the 512KB x 5
> stripe width.

That is very weird, as reads don't use the stripe cache at all - when
the array is not degraded and no overlapping writes are happening.

And the stripe_cache is measured in pages-per-device.  So 2560 means
2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.

When you set stripe_cache_size to 16384, it would have consumed
 16384*5*4K == 320Meg
or 1/3 of your available RAM.  This might have affected throughput,
I'm not sure.

> So the sequential reads at 200+ MB/s look okay (although I do not
> understand the huge throughput variability with record size), but the
> writes are not as high as with 16MB stripe cache. This may be the
> setting that I decide to stick with, but I would like to understand
> what is going on.

> Why did increasing the stripe cache from 256 KB to 16 MB decrease the
> sequential read speeds?

The only reason I can guess at is that you actually changed it from
from 5M to 320M, and maybe that affect available buffer memory?

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html