Fist, sorry for the length. I can tend to get windy talking shop. :) Andrew Klaassen put forth on 2/18/2011 2:31 PM: > It's IBM and LSI gear, so I'm crossing my fingers that a Linux install > will be relatively painless. Ahh, good. At least, so far it seems so. ;) > I thought that the filesystem block size was still limited to the kernel > page size, which is 4K on x86 systems. > > http://oss.sgi.com/projects/xfs/ > > "The maximum filesystem block size is the page size of the kernel, which > is 4K on x86 architecture." > > Is this no longer true? It would be awesome news if it wasn't. My mistake. It would appear you are limited to the page size, which, as I mentioned, is still 8 KiB for most distros. If you roll your own kernel you can obviously tweak this, but to what end? The kernel team's trend is toward smaller page sizes for greater memory usage efficiency. > My quick calculations were based on worst-case random read, which is > what we were seeing with the Exastore. They had a 64K blocksize * 48 > disks * 70 seeks per second = 215 MB/s, which is exactly what we were > seeing under load. Under heavy random load, I'm worried that XFS has to > either thrash the disks with 4K reads and writes ~or~ introduce > unnecessary latency by doing read-combining and write-combining and/or > predictive elevator hanky-panky. I think you're giving too much weight to the filesystem block size WRT random read IO throughput. Once you seek to the start of the file location on disk, there is no more effort involved in reading the next 128 disk sectors whether the XFS block size is 8 sectors or 128 sectors. And for files smaller than 64 KiB you're actually _decreasing_ your seek performance when using 64 KiB blocks. For instance, if you have a file that is 16 KiB and you have a 4 KiB block size, you'll head seek to the start of the file, read 4 blocks (32 sectors), and then the head is free to seek to the next request. With a 64 KiB block size, you seek to the start of the 16 KiB file, then read 128 sectors, the last 96 sectors being being empty, or contents of another file, and you just wasted time reading 96 sectors instead of allowing the head to seek to the next request. So, using a smaller block size doesn't give you decreased performance for large files, but it gives you better performance for small files. Also, 215 MB/s random IO seems absolutely horrible for 48 drives. Are these 15k FC/SAS drives or 7.2k SATA drives? A single 15k drive should sustain ~250-300 seeks/sec, a 7.2K drive about 100-150. 70 seeks/sec is below 5.4K laptop drive territory. Additionally, tweaking things like /sys/block/[dev]/queue/max_hw_sectors_kb /sys/block/[dev]/queue/nr_requests and the elevator /sys/block/[dev]/queue/scheduler Will affect this performance more than the FS block size. > I miscounted; it's 48 drives split into 6 hardware RAID-5 arrays. Eeuuww. RAID 5 is not known for stellar random read performance (nor stellar anything performance, especially horrible for random writes). Quite the opposite. A suggestion. You'd lose about 38% of your current space if my math is correct, but reconfiguring each of those as hardware RAID 10 instead of 5, and concatenating them with mdraid or LVM should give you at _minimum_ a 2:1 boost in sustained random read IOPS and bandwidth, probably much much much more. Random writes would be much higher still. If you can get by with that much less space I'd go with six 8 disk HW RAID 10s in the new setup assuming you have 6 LSI HBAs. Whatever the number of HBAs, create a RAID 10 on each with an equal number of drives on each HBA. It doesn't make sense to have more than one RAID pack on a single HBA--just slows it down considerably. If they did that with these RAID 5s that could explain some of the performance issue. I'd set the LSI RAID 10 stripe size to between 64KB - 256KB depending on your average file size. I'd then concatenate the resulting 6 devices (or however many there be) with mdadm or LVM (mdadm is probably a little faster, LVM more flexible). Then when creating your XFS filesystem, specify agcount=48 or (agcount=#HBAs*8) in this case, which will get you 8 allocation groups per HW array, in essence 2 AGs per stripe count spindle--8 disks, 4 striped mirror pairs, 4 stripe spindles. This should get you the parallelism you need for high performance multiuser random IO workloads. This all assumes a highly loaded server with a lot of access to multiple different files. If your access pattern is one heavy hitter app against only a few big files, getting parallelism via lots of allocation groups on concatenated storage may not be the way to go. In that case we'd need to go with multiple layer striping, with HW RAID 10 and software RAID 0 across them. I didn't recommend this because trying to get average size files broken up into chunks that fit neatly across a layered stripe is almost impossible, and you end up with weird allocation patterns on disk, wasted space issues, etc. I think it's better to use smallish HW stripes, no SW stripes, in an instance like this, and allow XFS to drive the parallelism via allocation groups. This yields better file layout on disk and better space utilization. In addition, using concatenation, as we recently learned from an OP who went through it (unfortunately), with this setup you can lose an entire hardware array and the FS can keep chugging along after a repair. You simply lose any files on the dead array. > Is there a way to monitor log operations to find out how much is being > used at a given time? Point in time? Probably not. I'm sure there's a counter somewhere but I'm not familiar with it. What you should be concerned with isn't necessarily how much of the journal log is being used at any point in time, but how fast the data is moving through the log. This is why the speed of the log device is critical, and the size is not. Recall that the max log size is 2GB. > All the metadata eventually has to be written to the main array, so > doesn't that ultimately become the limiting factor on metadata > throughput under sustained load? The answer is: it depends on the workload. Add another "depends" when using delaylog. For example, a given directory inode may be modified multiple times during a very short period of time. 'rm -rf' on a huge directory is a good example of this. A huge number of modifications to the directory are performed, but with delaylog they will be consolidated and coalesced into a single or a few actual writes into the journal and filesystem instead of many thousands of writes. These types of operations are historically where the metadata bottleneck lurked. If you simply have 1000 users hitting a fileserver and each user writes a file every minute or so, you'll never see a metadata bottleneck. If you have _and_ one user decides to delete a directory with 100k files in it, then you have a metadata bottleneck, at least, if you're not using delaylog. If you are using it you won't see much of a bottleneck. Although you'll see some pretty high CPU usage for a 100k file delete operation. But the load on the on disk journal log will be relatively light. Please keep us posted. I'm really interested to see what you end up doing with this and how it performs afterward. -- Stan _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs