On Thu, 23 Apr 2009 06:45:49 +0100, Andy Wallace <andy@xxxxxxxxxxxxxxxxxxxx> wrote: > On Wed, 2009-04-22 at 20:37 -0300, Flavio Junior wrote: >> On Wed, Apr 22, 2009 at 8:11 PM, Andy Wallace <andy@xxxxxxxxxxxxxxxxxxxx> >> wrote: >> >> > Although it's not as quick as I'd like, I'm getting about 150MB/s on >> > average when reading/writing files in the 100MB - 1GB range. However, >> > if >> > I try to write a 10GB file, this goes down to about 50MB/s. That's just >> > doing dd to the mounted gfs2 on an individual node. If I do a get from >> > an ftp client, I'm seeing half that; cp from an NFS mount is more like >> > 1/5. >> > >> >> Have you tried the same thing with another filesystem? Ext3 maybe ? >> Are you using RAID right? Did you check about RAID and LVM/partition >> alignment? >> >> If you will try ext3, see about -E stride and -E stripe_width values >> on mkfs.ext3 manpage. >> This calc should helps: http://busybox.net/~aldot/mkfs_stride.html >> > > Yes, I have (by the way, do you know how long ext3 takes to create a 6TB > filesystem???). It depends on what you consider to be "long". The last box I built had 8x 1TB 7200rpm disks in software md RAID6 = 6 TB usable, DAS, consumer grade motherboard, and 2 of the 8 ports were massively bottlenecked by a 32-bit 33MHz PCI SATA controller, but all 8 ports had NCQ. This took about 10-20 minutes (I haven't timed it exactly, about a cup of coffee long ;)) to mkfs ext3 when the parameters were properly set up. With default settings, it was taking around 10x longer, maybe even more. My findings are that the default settings and old wisdom often taken as axiomatically true are actually completely off the mark. Here is the line of reasoning that I have found to lead to best results. RAID block size of 64-256KB is way, way too big. It will kill the performance of small IOs without yielding a noticeable increase in performance for large IOs, and sometimes in fact hurting large IOs, too. To pick the optimum RAID block size, look at the disks. What is the multi-sector transfer size they can handle? I have not seen any disks to date that have this figure at anything other than 16, and 16sectors * 512 bytes/sector = 8KB. So set the RAID block size to 8KB. Make sure your stride parameter is set so that ext3 block size (usually 4KB) * stride = block size, in this case ext3 block size = 4, stride = 2, RAID block = 8KB. So far so good, but we're not done yet. The last thing to consider is the extent / block group size. The beginning of each block group contains a superblock for that group. It is the top of that inode tree, and needs to be checked to find any file/block in that group. That means the beginning block of a block group is a major hot-spot for I/O, as it has to be read for every read and written for every write to that group. This, in turn, means that for anything like reasonable performance you need to have the block group beginnings distributed evenly across all the disks in your RAID array, or else you'll hammer one disk while the others are sitting idle. For example, the default for ext3 is 32768 blocks in a block group. IIRC, on ext3, the adjustment can only be made in increments of 8 blocks (32KB assuming 4KB blocks). In GFS, IIRC, the minimum adjustment increment is 1MB(!) (not a FS limitation but a mkfs.gfs limitation, I'm told). The optimum number of blocks in a group will depend on the RAID level and the number of disks in the array, but you can simplify it into a RAID0 equivalent for the purpose of this exercise e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. Ideally you want the block group size to align to the stripe width +/- 1 stride width so that the block group beginnings rotate among the disks (upward for +1 stride, downward for -1 stride, both will achieve the same purpose). The stripe width in the case described is 8KB * 6 disks = 48KB. So, you want block group to align to a multiple of 8KB * 7 disks = 56KB. But be careful here - you should aim for a number that is a multiple of 56KB, but not a multiple of 48KB because if they line up, you haven't achieved anything and you're back where you started! 56KB is 14 4KB blocks. Without getting involved in a major factoring exercise, 28,000 blocks sounds good (default is 32768 for ext3, which is in a reasonable ball park). 28,000*4KB is a multiple of 56KB but not 48KB, so it looks like a good choice in this example. Obviously, you'll have to work out the optimal numbers for your specific RAID configuration, the above example is for: disk multi-sector = 16 ext3 block size = 4KB RAID block size = 8KB ext3 stride = 2 RAID = 6 disks = 8 This is one of the key reasons why I think LVM is evil. It abstracts things and encourages no forward thinking. Adding a new volume is the same as adding a new disk to a software RAID to stretch it. It'll upset the block group size calculation and in one fell swoop take away the advantage of load balancing across all the disks you have. By doing this you can cripple the performance on some operations from scaling linearly with the number of disks to being bogged down to the performance of just one disk. This can make a massive difference to IOPS figures you get out of a storage system, but I suspect that enterprise storage vendors are much happier being able to sell more (and more expensive) equipment when the performance gets crippled through misconfiguration, or even just lack of consideration of parameters such as the above. This is also why quite frequently a cheap box made of COTS components can complete blow away a similar enterprise grade box with 10-100x the price tag. > I've aligned the RAID and LVM stripes using various different values, > and found slight improvements in performance as a result. My main > problem is that when the file size hits a certain point, performance > degrades alarmingly. For example, on NFS moving a 100M file is about 20% > slower than direct access, with a 5GB file it's 80% slower (and the > direct access itself is 50% slower). > > As I said before, I'll be working with 20G-170G files, so I really have > to find a way around this! Have you tried increasing the resource group sizes? IIRC the default is 256MB (-r parameter to mkfs.gfs), which may well have a considerable impact when you are shifting huge files around. Try upping it, potentially by a large factor, and see how that affects your large file performance. Note - "resource group" in GFS is the same as the "block group" described above for ext3 in the performance optimization example. However, while ext3 is adjustable in multiples of 8 blocks (32KB), gfs is only adjustable in increments of 1MB. HTH Gordan -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster