On Thu, Apr 23, 2009 at 10:41:20AM +0100, Gordan Bobic wrote: > On Thu, 23 Apr 2009 06:45:49 +0100, Andy Wallace > <andy@xxxxxxxxxxxxxxxxxxxx> > wrote: > > On Wed, 2009-04-22 at 20:37 -0300, Flavio Junior wrote: > >> On Wed, Apr 22, 2009 at 8:11 PM, Andy Wallace > <andy@xxxxxxxxxxxxxxxxxxxx> > >> wrote: > >> > >> > Although it's not as quick as I'd like, I'm getting about 150MB/s on > >> > average when reading/writing files in the 100MB - 1GB range. However, > >> > if > >> > I try to write a 10GB file, this goes down to about 50MB/s. That's > just > >> > doing dd to the mounted gfs2 on an individual node. If I do a get from > >> > an ftp client, I'm seeing half that; cp from an NFS mount is more like > >> > 1/5. > >> > > >> > >> Have you tried the same thing with another filesystem? Ext3 maybe ? > >> Are you using RAID right? Did you check about RAID and LVM/partition > >> alignment? > >> > >> If you will try ext3, see about -E stride and -E stripe_width values > >> on mkfs.ext3 manpage. > >> This calc should helps: http://busybox.net/~aldot/mkfs_stride.html > >> Oh.. nice tool. I've been always calculating manually. Then again the math is pretty simple.. > > > > Yes, I have (by the way, do you know how long ext3 takes to create a 6TB > > filesystem???). > > It depends on what you consider to be "long". The last box I built had > 8x 1TB 7200rpm disks in software md RAID6 = 6 TB usable, DAS, consumer > grade motherboard, and 2 of the 8 ports were massively bottlenecked by > a 32-bit 33MHz PCI SATA controller, but all 8 ports had NCQ. This took > about 10-20 minutes (I haven't timed it exactly, about a cup of coffee > long ;)) to mkfs ext3 when the parameters were properly set up. With > default settings, it was taking around 10x longer, maybe even more. > Interesting information.. what settings did you tweak for faster mkfs? Just the things mentioned below? > My findings are that the default settings and old wisdom often taken > as axiomatically true are actually completely off the mark. > > Here is the line of reasoning that I have found to lead to best results. > > RAID block size of 64-256KB is way, way too big. It will kill > the performance of small IOs without yielding a noticeable increase > in performance for large IOs, and sometimes in fact hurting large > IOs, too. > > To pick the optimum RAID block size, look at the disks. What is the > multi-sector transfer size they can handle? I have not seen any disks > to date that have this figure at anything other than 16, and > 16sectors * 512 bytes/sector = 8KB. > Hmm.. how can you determine multi-sector transfer size from a disk? > So set the RAID block size to 8KB. > > Make sure your stride parameter is set so that > ext3 block size (usually 4KB) * stride = block size, > in this case ext3 block size = 4, stride = 2, RAID block = 8KB. > > So far so good, but we're not done yet. The last thing to consider > is the extent / block group size. The beginning of each block group > contains a superblock for that group. It is the top of that inode > tree, and needs to be checked to find any file/block in that group. > That means the beginning block of a block group is a major hot-spot > for I/O, as it has to be read for every read and written for every > write to that group. This, in turn, means that for anything like > reasonable performance you need to have the block group beginnings > distributed evenly across all the disks in your RAID array, or else > you'll hammer one disk while the others are sitting idle. > > For example, the default for ext3 is 32768 blocks in a block group. > IIRC, on ext3, the adjustment can only be made in increments of 8 > blocks (32KB assuming 4KB blocks). In GFS, IIRC, the minimum > adjustment increment is 1MB(!) (not a FS limitation but a mkfs.gfs > limitation, I'm told). > > The optimum number of blocks in a group will depend on the RAID > level and the number of disks in the array, but you can simplify > it into a RAID0 equivalent for the purpose of this exercise > e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. > Ideally you want the block group size to align to the stripe > width +/- 1 stride width so that the block group beginnings > rotate among the disks (upward for +1 stride, downward for > -1 stride, both will achieve the same purpose). > > The stripe width in the case described is 8KB * 6 disks = 48KB. > So, you want block group to align to a multiple of > 8KB * 7 disks = 56KB. But be careful here - you should aim for > a number that is a multiple of 56KB, but not a multiple of > 48KB because if they line up, you haven't achieved anything > and you're back where you started! > > 56KB is 14 4KB blocks. Without getting involved in a major > factoring exercise, 28,000 blocks sounds good (default is > 32768 for ext3, which is in a reasonable ball park). > 28,000*4KB is a multiple of 56KB but not 48KB, so it looks > like a good choice in this example. > > Obviously, you'll have to work out the optimal numbers for > your specific RAID configuration, the above example is for: > disk multi-sector = 16 > ext3 block size = 4KB > RAID block size = 8KB > ext3 stride = 2 > RAID = 6 > disks = 8 > > This is one of the key reasons why I think LVM is evil. It abstracts > things and encourages no forward thinking. Adding a new volume is > the same as adding a new disk to a software RAID to stretch it. > It'll upset the block group size calculation and in one fell swoop > take away the advantage of load balancing across all the disks you > have. By doing this you can cripple the performance on some > operations from scaling linearly with the number of disks to being > bogged down to the performance of just one disk. > > This can make a massive difference to IOPS figures you get out > of a storage system, but I suspect that enterprise storage vendors > are much happier being able to sell more (and more expensive) > equipment when the performance gets crippled through misconfiguration, > or even just lack of consideration of parameters such as the above. > This is also why quite frequently a cheap box made of COTS components > can complete blow away a similar enterprise grade box with 10-100x > the price tag. > > > I've aligned the RAID and LVM stripes using various different values, > > and found slight improvements in performance as a result. My main > > problem is that when the file size hits a certain point, performance > > degrades alarmingly. For example, on NFS moving a 100M file is about 20% > > slower than direct access, with a 5GB file it's 80% slower (and the > > direct access itself is 50% slower). > > > > As I said before, I'll be working with 20G-170G files, so I really have > > to find a way around this! > > Have you tried increasing the resource group sizes? IIRC the default is > 256MB (-r parameter to mkfs.gfs), which may well have a considerable > impact when you are shifting huge files around. Try upping it, > potentially by a large factor, and see how that affects your large file > performance. > > Note - "resource group" in GFS is the same as the "block group" described > above for ext3 in the performance optimization example. However, while > ext3 is adjustable in multiples of 8 blocks (32KB), gfs is only adjustable > in increments of 1MB. > Very good information you have here.. Thanks for posting it. -- Pasi -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster