On Nov 9, 2007 11:16 PM, Andreas Dilger <adilger@xxxxxxx> wrote: > On Nov 09, 2007 19:11 -0700, Chris Worley wrote: > > How do you measure/gauge/assure proper alignment? > > > > The physical disk has a block structure. What is it or how do you > > find it? I'm guessing it's best to not partition disks in order to > > assure that whatever it's block read/write is isn't bisected by the > > partition. > > For Lustre we never partition the disks for exactly this reason, and if > you are using LVM/md on the whole device it doesn't make sense either. > > > Then, mdadm has some block structure. The "-c" ("chunk") is in > > "kibibytes" (feed the dog kibbles?), with a default of 64. Not a clue > > what they're trying to do. > > That just means for RAID 0/5/6 that the amount of data or parity in a > stripe is a multipe of the chunk size, i.e. for a 4+1 RAID5 you get: > > disk0 disk1 disk2 disk3 disk4 > [64kB][64kB][64kB][64kB][64kB] > [64kB][64kB]... > > > Finally, mkfs.ext[23] has a "stride", which is defined as a "stripe > > size" in the man page (and I thought all your stripes added together > > are a "stride"), as well as a block size. > > For ext2/3/4 the stride size (in kB) == the mdadm chunk size. Note that > the ext2/3/4 stride size is in units of filesystem blocks, so if you have > 4kB filesystem blocks (default for filesystems > 500MB) and a 64kB RAID5 > chunk size, this is 16: > > e2fsck -E stride=16 /dev/md0 So, if: B=Ext Block size S=Ext Stride size C=MD Chunk size Then: S=C/B Is that correct? Ignorantly/randomly shopping around for values (using 1MB block sizes and 16GB transfers in DD as the benchmark), I found performance increased as I increased the MD chunk (testing just the MD device), but, greater than 1024, the MD performance increased, but the EXT fs got slower. Strangely the EXT stride performed best set at 2048 (the above equation says 256 would have been correct): mdadm --create /dev/md0 --level=0 --chunk=1024 --raid-devices 12 /dev/sd[b-m] mkfs.ext2 -T largefile4 -b 4096 -E stride=2048 /dev/md0 So, it may be best put that "S", in the equation above, is some factor of the stride value used. Note that I am trying to optimize for big blocks and big files, with little regard for data reliability. I also found some strange performance differences using different manufacturer's disks. I have a bunch of Maxtor 15K and Seagate 10K SCSI disks. Streaming to a single drive serially, the Maxtor disks are faster, but, in parallel, the Seagate drives are faster. I measure this with something like: for i in /dev/sd[e-r] do /usr/bin/time -f "$i: %e" \ dd bs=1024k count=16000 of=/dev/null if=$i 2>&1 \ | grep -v records & done wait This test doesn't truly emulate an MD device, as each disk is treated independently; a given disk is allowed to get ahead of the rest... why the Seagates outperform the Maxtors is unknown. They are evenly distributed across the SCSI channels (as many Seagates on a channel as Maxtors). I'm guessing the Seagate disks have deeper buffers. I remember a few years ago increasing the number of outstanding scatter/gather requests helped increase the performance of Qlogic FC drivers... is there any such driver or kernel tweak these days? I'd still like to know what the disks use for a block size. Thanks, Chris P.S. Andreas: Hope your having fun at SC07... I don't get to go :( > > > It's important to make sure these all align properly, but their definitions > > do. > > ... do not? > > > Could somebody please clarify... with an example? > > Yes, I constantly wish the terminology were constant between different tools, > but sadly there isn't any "proper" terminology out there as far as I've been > able to see. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Software Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users