On 10/27/2012 9:55 PM, Erez Zarum wrote: > I have already tried using linear mode, performance drops significantly. > To show you how big the impact is, with seqrd 64 threads (sysbench) i > get about 2.5Gb/s using striped LVM, with linear mode i get around > 800MB/s. Apologies if I seemed hostile, or short tempered, Erez. We get a lot of long winded people on here who "dream out loud" and end up wasting a lot time with "what ifs". I mistook you for such a person. So, with an XFS over concat, if your benchmark only writes/reads to 1 or a few directories you're only going to hit the first few allocation groups, which means you're likely only hitting the first RAID set or two. You can fix this by either using far more directories and hitting all AGs in the concat, i.e. hitting all RAID sets, or you can manually specify "-d agcount=" at mkfs time to reduce the number of allocation groups, precisely matching agcount to your arrays, to achieve the same result with fewer directories. For instance, you're using five 4+1 RAID5 arrays. The max XFS agsize is 1TB. If these are 250GB SSDs or smaller, you would do $ mkfs.xfs -d agcount=5 /dev/md0 and end up with exactly 1 AG per RAID set, 5 AGs total. In actuality 5 AGs is likely too few as it will limit allocation parallelism to a degree, though not as much as with rust. Thus, if your real workload creates lots (thousands) of files in parallel then you probably want to use 10 AGs total, 2 per array. Run your benchmark against 10 directories and you should see something close to that 2.5GB/s figure you achieve with LVM striping, possibly more depending on the files and access patterns, unless you have a single SFF8088 cable between the controller and the expander, in which case 2.5GB/s is pretty much the speed limit. Second, benchmarks are synthetic tests for comparing hardware and operating systems in apples to apples tests. They are not application workloads and rarely come anywhere close to mimicking a real workload. There are few, if any, real world workloads that require 800MB/s streaming throughput, let alone 2.5GB/s, to innodb files. Maybe you have one, but I doubt it. I forgot to mention in my previous reply that you'll want to add "inode64" to your mount options. This changes the allocation behavior of XFS in a very positive way, increasing directory metadata and allocation performance quite a bit for parallel workloads. This has more positive effect on rust but is still helpful with SSD. > I don't want to be rude, but please, before saying what i have below > is a damn mess, it's after i have spent hours of running benchmarks > and getting the correct numbers. Well, the way you described it made it look so. ;) To me, and others, LVM is simply a mess. If one absolutely needs to take snapshots one must have it. If not, I say don't use it. Using XFS with md linear yields a cleaner, high performance solution, with easy infinite expandability. > I assume you told me to do this because it's SSD and i will saturate > the PCI BUS before i will be able to saturate the disks, this > assumption is usually wrong, especially when this LSI Controller is > Gen3. My recommendation had nothing to do with hardware limitations. But since you mentioned this I'll point out that RAID ASICs nearly always bottleneck before the PCIe bus, especially when using parity arrays. Running parity RAID on the dual core 2208 controllers will top out at about 3GB/s with everything optimally configured, assuming SSDs or so many rust disks they outrun the ASIC. You're close to the max ASIC throughput with your 2.5GB/s LVM stripe setup. If you truly want maximum performance from your SSDs, which can likely stream 400MB/s+ each, you should be using something like 4x 9207-8i in a 32+ slot chassis, with 4 more SSDs, 32 total. You'd configure 4x 6+1 mdadm RAID5 arrays (one array per HBA), with 4 spares. Add the RAID5s to a linear array, of course with XFS atop and the proper number of AGs for your workload and SSD size. With 8 sufficiently stout CPU cores (4 cores to handle the 4 md/RAID5 write threads and 4 for interrupts, the application, etc) on a good high bandwidth system board design, proper Linux tuning, msi-x, irqbalance, etc, you should be able to get 300MB/s per SSD, or ~7GB/s aggregate, more than double that of the, I'm guessing here, 9286-8e you're using. Bear in mind that hitting write IO of 7GB/s, with any hardware, will require substantial tuning effort. Joe Landman probably has more experience than anyone here with such SSD setups and might be able to give you some pointers, should 2.5GB/s be insufficient for your needs. ;) > I also don't need you to explain this to me as I understand exactly > why it's slow, I came here asking a simple question which we can > summarize: "If i have PV which is based on mdadm array, i then expand > the mdadm array, do i lose the data alignment in LVM?" I "explained" the XFS + md option because it eliminates this alignment confusion entirely, and it's simply a less complex, better solution. > I always opt to mailing list as the last option and this was also a > suggestion i got from the IRC channel. XFS + md linear was suggested on IRC? Or something to do with LVM was suggested? >>> The problem i have in this setup is that i couldn't make it work, I >>> know i need to align the XFS allocation group with the LV boundaries, Again, using the setup I mentioned, XFS writeout will be in 4KB blocks eliminating possible filesystem stripe misalignment issues. You can still align XFS if you want, but for a workload comprised of 16KB writes, it won't gain you much, if anything. The BBWC will take care of coalescing adjacent sectors during writeback scheduling so FS stripe alignment isn't a big issue, especially with delaylog coalescing log writes. Speaking of which, are you using kernel 2.6.39 or later? >>> but i couldn't find a way to do it correctly, during my benchmarks i >>> utilized only 1 disk and didn't get that much parallel I/O (regardless >>> of threads). If you were allocating to a single directory or just a few, this would tend to explain the lack of parallelism. XFS allocations to a single directory, therefor a single AG, are mostly serialized. To get allocation parallelism, you must allocate to multiple AGs, to multiple directories. This AG based allocation parallelism is one of XFS' greatest strengths, if one's workload does a lot of parallel allocation. Speaking of which, you seem to be benchmarking file creation. Most database workloads are append heavy, not allocation heavy. Can you briefly describe your workload's file access patterns? How many db files you'll have, and which ones will be written to often? Having 20+ SSDs may not help your application much if most of the write IO is to only a handful of files. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html