On Sat, 24 Apr 2010 16:36:20 -0700 Joe Williams <jwilliams315@xxxxxxxxx> wrote: > I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2 > TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default > parameters, including a 512 KB chunk size. It took about 6 hours to > initialize, then I created an XFS filesystem: > > # mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=45776384 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=1464843648, imaxpct=5 > = sunit=128 swidth=384 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=64 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit. > > Then it was time to optimize the performance. First I ran a benchmark > with the default settings (from a recent Arch linux install) for the > following parameters: > > # cat /sys/block/md0/md/stripe_cache_size > 256 > > # cat /sys/block/md0/queue/read_ahead_kb > 3072 2 full stripes - that is right. > # cat /sys/block/sdb/queue/read_ahead_kb > 128 This number is completely irrelevant. Only the read_ahead_kb of the device that the filesystem sees is used. > > # cat /sys/block/md0/queue/scheduler > none > # cat /sys/block/sdb/queue/scheduler > noop deadline [cfq] > > # cat /sys/block/md0/queue/nr_requests > 128 > # cat /sys/block/sdb/queue/nr_requests > 128 > > # cat /sys/block/md0/device/queue_depth > cat: /sys/block/md0/device/queue_depth: No such file or directory > # cat /sys/block/sdb/device/queue_depth > 31 > > # cat /sys/block/md0/queue/max_sectors_kb > 127 > # cat /sys/block/sdb/queue/max_sectors_kb > 512 > > Note that sdb is one of the 5 drives for the RAID volume, and the > other 4 have the same settings. > > First question, is it normal for the md0 scheduler to be "none"? I > cannot change it by writing, eg., "deadline" into the file. > Because software-RAID is not disk drive, does not use and elevator and so does not use a scheduler. The whole 'queue' directory really shouldn't appear for md devices but for some very boring reasons it does. > Next question, is it normal for md0 to have no queue_depth setting? Yes. The stripe_cache_size is conceptually a similar think, but only at a very abstract level. > > Are there any other parameters that are important to performance that > I should be looking at? No. > > I started the kernel with mem=1024M so that the buffer cache wasn't > too large (this machine has 72G of RAM), and ran an iozone benchmark: > > Iozone: Performance Test of File I/O > Version $Revision: 3.338 $ > Compiled for 64 bit mode. > Build: linux-AMD64 > > Auto Mode > Using Minimum Record Size 64 KB > Using Maximum Record Size 16384 KB > File size set to 4194304 KB > Include fsync in write timing > Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random > random > KB reclen write rewrite read reread read write > 4194304 64 133608 114920 191367 191559 7772 > 14718 > 4194304 128 142748 113722 165832 161023 14055 > 20728 > 4194304 256 127493 108110 165142 175396 24156 > 23300 > 4194304 512 136022 112711 171146 165466 36147 > 25698 > 4194304 1024 140618 110196 153134 148925 57498 > 39864 > 4194304 2048 137110 108872 177201 193416 98759 > 50106 > 4194304 4096 138723 113352 130858 129940 78636 > 64615 > 4194304 8192 140100 114089 175240 168807 109858 > 84656 > 4194304 16384 130633 116475 131867 142958 115147 > 102795 > > > I was expecting a little faster sequential reads, but 191 MB/s is not > too bad. I'm not sure why it decreases to 130-131 MB/s at larger > record sizes. I don't know why it would decrease either. For sequential reads, read-ahead should be scheduling all the read requests and that actual reads should just be waiting for the read-ahead to complete. So there shouldn't be any variability - clearly there is. I wonder if it is an XFS thing.... care to try a different filesystem for comparison? ext3? > > But the writes were disappointing. So the first thing I tried tuning > was stripe_cache_size > > # echo 16384 > /sys/block/md0/md/stripe_cache_size > > I re-ran the iozone benchmark: > > random > random > KB reclen write rewrite read reread read write > 4194304 64 219206 264113 104751 108118 7240 > 12372 > 4194304 128 232713 255337 153990 142872 13209 > 21979 > 4194304 256 229446 242155 132753 131009 20858 > 32286 > 4194304 512 236389 245713 144280 149283 32024 > 44119 > 4194304 1024 234205 243135 141243 141604 53539 > 70459 > 4194304 2048 219163 224379 134043 131765 84428 > 90394 > 4194304 4096 226037 225588 143682 146620 60171 > 125360 > 4194304 8192 214487 231506 135311 140918 78868 > 156935 > 4194304 16384 210671 215078 138466 129098 96340 > 178073 > > And now the sequential writes are quite satisfactory, but the reads > are low. Next I tried 2560 for stipe_cache_size, since that is the 512KB x 5 > stripe width. That is very weird, as reads don't use the stripe cache at all - when the array is not degraded and no overlapping writes are happening. And the stripe_cache is measured in pages-per-device. So 2560 means 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes. When you set stripe_cache_size to 16384, it would have consumed 16384*5*4K == 320Meg or 1/3 of your available RAM. This might have affected throughput, I'm not sure. > So the sequential reads at 200+ MB/s look okay (although I do not > understand the huge throughput variability with record size), but the > writes are not as high as with 16MB stripe cache. This may be the > setting that I decide to stick with, but I would like to understand > what is going on. > Why did increasing the stripe cache from 256 KB to 16 MB decrease the > sequential read speeds? The only reason I can guess at is that you actually changed it from from 5M to 320M, and maybe that affect available buffer memory? NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html