Well, i don't dream out loud :), i don't want to waste anyone's time, especially not those on the mailing list. To be more accurate, i don't have a lot of small files or directories, I have about 3 big tables (each one is a file) that have uneven size, one is small with about 400GB, another one is around 2.5TB and another one around 3TB. I can't use a small agcount size like 5 because then i will have the agsize larger than 1TB. Each SSD disk is 480GB which then gives me 440GB usable space. So the minimum agcount i can use is 9, and i already tried that and it didn't help. The benchmark i am doing usually predict quite good results when moving from staging to production, this was just a small test case for the benchmark, it's running a lot of other test, all of them only used 1 RAID set at a time, for obvious reasons (as i understand) as i don't use a lot of small files. On my other tests i have used the inode64, now i have tested inode64 with linear and it didn't help. I do want to use LVM especially because of the snapshots options, I came here because i wanted to know in case i grow a mdadm device that is a PV, what happens to the alignment, this is the same situation i am in. As for your last note, i have one server with identical configuration, 3 LSI HBAs with 24 SSDs and it perform well, but i don't intend to scale this one up. I chose the MegaRAID because i wanted the option to expand RAID50 without any problems, if i knew it won't be possible i would have gotten the HBAs instead. The server has 2 x Intel E5-2670 which yield 32 cores with Hyper-Threading enabled and i also setup the interrupts correctly so they will be distributed across the first CPU (which the controller is attached to) and disabled irqbalance, i can see with mpstat that i utilize all cores evenly. XFS + md linear was not suggested on IRC, it was suggested to ask in here what happens if i grow a raid device (mdadm for that case) which is a PV, do i lose the alignment? I will probably disable the writeback cache as i saw i get better performance without it (also LSI recommends disabling it when using FastPath). I am using CentOS 6.3 (2.6.32-279.11.1.el6.x86_64), i know log delay is enabled by default. I use a few files as this predict what i have (3 big tables/3 big files), i don't just benchmark file allocation, i also benchmark different types of read patterns (random read, random read/write, sequential read). My database is mostly read workload (80% read, 20% write). The biggest concern is when scaling up. Thanks! On Sun, Oct 28, 2012 at 8:34 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote: > On 10/27/2012 9:55 PM, Erez Zarum wrote: >> I have already tried using linear mode, performance drops significantly. >> To show you how big the impact is, with seqrd 64 threads (sysbench) i >> get about 2.5Gb/s using striped LVM, with linear mode i get around >> 800MB/s. > > Apologies if I seemed hostile, or short tempered, Erez. We get a lot of > long winded people on here who "dream out loud" and end up wasting a lot > time with "what ifs". I mistook you for such a person. > > So, with an XFS over concat, if your benchmark only writes/reads to 1 or > a few directories you're only going to hit the first few allocation > groups, which means you're likely only hitting the first RAID set or > two. You can fix this by either using far more directories and hitting > all AGs in the concat, i.e. hitting all RAID sets, or you can manually > specify "-d agcount=" at mkfs time to reduce the number of allocation > groups, precisely matching agcount to your arrays, to achieve the same > result with fewer directories. > > For instance, you're using five 4+1 RAID5 arrays. The max XFS agsize is > 1TB. If these are 250GB SSDs or smaller, you would do > > $ mkfs.xfs -d agcount=5 /dev/md0 > > and end up with exactly 1 AG per RAID set, 5 AGs total. In actuality 5 > AGs is likely too few as it will limit allocation parallelism to a > degree, though not as much as with rust. Thus, if your real workload > creates lots (thousands) of files in parallel then you probably want to > use 10 AGs total, 2 per array. Run your benchmark against 10 > directories and you should see something close to that 2.5GB/s figure > you achieve with LVM striping, possibly more depending on the files and > access patterns, unless you have a single SFF8088 cable between the > controller and the expander, in which case 2.5GB/s is pretty much the > speed limit. > > > Second, benchmarks are synthetic tests for comparing hardware and > operating systems in apples to apples tests. They are not application > workloads and rarely come anywhere close to mimicking a real workload. > There are few, if any, real world workloads that require 800MB/s > streaming throughput, let alone 2.5GB/s, to innodb files. Maybe you > have one, but I doubt it. > > I forgot to mention in my previous reply that you'll want to add > "inode64" to your mount options. This changes the allocation behavior > of XFS in a very positive way, increasing directory metadata and > allocation performance quite a bit for parallel workloads. This has > more positive effect on rust but is still helpful with SSD. > >> I don't want to be rude, but please, before saying what i have below >> is a damn mess, it's after i have spent hours of running benchmarks >> and getting the correct numbers. > > Well, the way you described it made it look so. ;) To me, and others, > LVM is simply a mess. If one absolutely needs to take snapshots one > must have it. If not, I say don't use it. Using XFS with md linear > yields a cleaner, high performance solution, with easy infinite > expandability. > >> I assume you told me to do this because it's SSD and i will saturate >> the PCI BUS before i will be able to saturate the disks, this >> assumption is usually wrong, especially when this LSI Controller is >> Gen3. > > My recommendation had nothing to do with hardware limitations. But > since you mentioned this I'll point out that RAID ASICs nearly always > bottleneck before the PCIe bus, especially when using parity arrays. > Running parity RAID on the dual core 2208 controllers will top out at > about 3GB/s with everything optimally configured, assuming SSDs or so > many rust disks they outrun the ASIC. You're close to the max ASIC > throughput with your 2.5GB/s LVM stripe setup. > > If you truly want maximum performance from your SSDs, which can likely > stream 400MB/s+ each, you should be using something like 4x 9207-8i in a > 32+ slot chassis, with 4 more SSDs, 32 total. You'd configure 4x 6+1 > mdadm RAID5 arrays (one array per HBA), with 4 spares. Add the RAID5s > to a linear array, of course with XFS atop and the proper number of AGs > for your workload and SSD size. > > With 8 sufficiently stout CPU cores (4 cores to handle the 4 md/RAID5 > write threads and 4 for interrupts, the application, etc) on a good high > bandwidth system board design, proper Linux tuning, msi-x, irqbalance, > etc, you should be able to get 300MB/s per SSD, or ~7GB/s aggregate, > more than double that of the, I'm guessing here, 9286-8e you're using. > Bear in mind that hitting write IO of 7GB/s, with any hardware, will > require substantial tuning effort. Joe Landman probably has more > experience than anyone here with such SSD setups and might be able to > give you some pointers, should 2.5GB/s be insufficient for your needs. ;) > >> I also don't need you to explain this to me as I understand exactly >> why it's slow, I came here asking a simple question which we can >> summarize: "If i have PV which is based on mdadm array, i then expand >> the mdadm array, do i lose the data alignment in LVM?" > > I "explained" the XFS + md option because it eliminates this alignment > confusion entirely, and it's simply a less complex, better solution. > >> I always opt to mailing list as the last option and this was also a >> suggestion i got from the IRC channel. > > XFS + md linear was suggested on IRC? Or something to do with LVM was > suggested? > >>>> The problem i have in this setup is that i couldn't make it work, I >>>> know i need to align the XFS allocation group with the LV boundaries, > > Again, using the setup I mentioned, XFS writeout will be in 4KB blocks > eliminating possible filesystem stripe misalignment issues. You can > still align XFS if you want, but for a workload comprised of 16KB > writes, it won't gain you much, if anything. The BBWC will take care of > coalescing adjacent sectors during writeback scheduling so FS stripe > alignment isn't a big issue, especially with delaylog coalescing log > writes. Speaking of which, are you using kernel 2.6.39 or later? > >>>> but i couldn't find a way to do it correctly, during my benchmarks i >>>> utilized only 1 disk and didn't get that much parallel I/O (regardless >>>> of threads). > > If you were allocating to a single directory or just a few, this would > tend to explain the lack of parallelism. XFS allocations to a single > directory, therefor a single AG, are mostly serialized. To get > allocation parallelism, you must allocate to multiple AGs, to multiple > directories. This AG based allocation parallelism is one of XFS' > greatest strengths, if one's workload does a lot of parallel allocation. > > Speaking of which, you seem to be benchmarking file creation. Most > database workloads are append heavy, not allocation heavy. Can you > briefly describe your workload's file access patterns? How many db > files you'll have, and which ones will be written to often? Having 20+ > SSDs may not help your application much if most of the write IO is to > only a handful of files. > > -- > Stan > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html