On 3/4/2013 10:26 AM, Adam Goryachev wrote: >> Whatever value you choose, make it permanent by adding this entry to >> root's crontab: >> >> @reboot /bin/echo 32768 > /sys/block/md0/md/stripe_cache_size > > Already added to /etc/rc.local along with the config to set the deadline > scheduler for each of the RAID drives. You should be using noop for SSD, not deadline. noop may improve your FIO throughput, nad real workload, even further. Also, did you verify with a reboot that stripe_cache_size is actually being set correctly at startup? If it's not working as assumed you'll be losing several hundred MB/s of write throughput at the next reboot. Something this critical should always be tested and verified. > stripe_cache_size = 4096 >> READ: io=131072MB, aggrb=2504MB/s, minb=2564MB/s, maxb=2564MB/s, mint=52348msec, maxt=52348msec >> WRITE: io=131072MB, aggrb=1590MB/s, minb=1628MB/s, maxb=1628MB/s, mint=82455msec, maxt=82455msec Wow, we're up to 1.6 GB/s data throughput, 2 GB/s total md device throughput. That's 407MB/s per SSD. This is much more inline with what one would expect from a RAID5 using 5 large, fast SandForce SSDs. This is 80% of the single drive streaming write throughput of this SSD model, as tested by Anandtech, Tom's, and others. I'm a bit surprised we're achieving 2 GB/s parity write throughput with the single threaded RAID5 driver on one core. Those 3.3GHz Ive Bridge cores are stouter than I thought. Disabling HT probably helped a bit here. I'm anxious to see the top output file for this run (if you made one--you should for each and every FIO run). Surely we're close to peaking the core here. > stripe_cache_size = 8192 >> READ: io=131072MB, aggrb=2487MB/s, minb=2547MB/s, maxb=2547MB/s, mint=52697msec, maxt=52697msec >> WRITE: io=131072MB, aggrb=1521MB/s, minb=1557MB/s, maxb=1557MB/s, mint=86188msec, maxt=86188msec Interesting. 4096/8192 are both higher by ~300MB/s compared to the previous 1292MB/s you posted for 8192. Some other workload must have been active during the previous run, or something else has changed. > stripe_cache_size = 16384 >> READ: io=131072MB, aggrb=2494MB/s, minb=2554MB/s, maxb=2554MB/s, mint=52556msec, maxt=52556msec >> WRITE: io=131072MB, aggrb=1368MB/s, minb=1401MB/s, maxb=1401MB/s, mint=95779msec, maxt=95779msec > > stripe_cache_size = 32768 >> READ: io=131072MB, aggrb=2489MB/s, minb=2549MB/s, maxb=2549MB/s, mint=52661msec, maxt=52661msec >> WRITE: io=131072MB, aggrb=1138MB/s, minb=1165MB/s, maxb=1165MB/s, mint=115209msec, maxt=115209msec This is why you test, and test, and test when tuning for performance. 4096 seems to be your sweet spot. > (let me know if you want the full fio output....) No, the summary is fine. What's more more valuable to have the top output file for each run so I can see what's going on. At 2 GB/s of throughput your interrupt rate should be pretty high, and I'd like to see the IRQ spread across the cores, as well as the RAID5 thread load, among other things. I haven't yet looked at the file you sent, but I'm guessing it doesn't include this 1.6GB/s run. I'm really interested in seeing that one, and the ones for 16384 and 32768. WRT the latter two, I'm curious whether the much larger tables are causing excessive CPU burn, which may in turn be what lowers throughput. > This seems to show that DRBD did not slow things down at all... I don't I noticed. > remember exactly when I did the previous fio tests with drbd connected, > but perhaps I've made changes to the drbd config since then and/or > upgraded from the debian stable drbd to 8.3.15 Maybe it wasn't actively syncing when you made these FIO runs. > Let's re-run the above tests with DRBD stopped: ... > stripe_cache_size = 4096 >> READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec >> WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec > > stripe_cache_size = 8192 >> READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec >> WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec ... Numbers are identical. Either BRBD wasn't actually copying anything during the previous FIO run, its nice level changed, its configuration/behavior changed with the new version, or something. Whatever the reason, it appears to be putting no load on the array. > So, it looks like the ideal value is actually smaller (4096) although > there is not much difference between 8192 and 4096. It seems strange > that a larger cache size will actually reduce performance... I'll change It's not strange at all, but expected. As a table gets larger it takes more CPU cycles to manage it and more memory bandwidth; your cache miss rate increases, etc. At a certain point this overhead becomes detrimental instead of beneficial. In your case the size of the cache table outweighs the overhead and yields increased performance up to 80MB table size. At 160MB and above the size of the table creates more overhead than performance benefit. This is what system testing/tuning is all about. > to 4096 for the time being, unless you think "real world" performance > might be better with 8192? These FIO runs are hitting your IO subsystem much harder than your real workloads every will. Stick with 4096. > Here are the results of re-running fio using the previous config (with > drbd connected with the stripe_cache_size = 8192): >> READ: io=4096MB, aggrb=2244MB/s, minb=2298MB/s, maxb=2298MB/s, mint=1825msec, maxt=1825msec >> WRITE: io=4096MB, aggrb=494903KB/s, minb=506780KB/s, maxb=506780KB/s, mint=8475msec, maxt=8475msec > > Perhaps the old fio test just isn't as well suited to the way drbd > handles things. Though the issue would be what sort of data the real > users are doing, because if that matches the old fio test or the new fio > test, it makes a big difference. The significantly lower throughput of the "old" FIO job has *nothing* to do with DRBD. It has everything to do with the parameters of the job file. I thought I explained the differences previously. If not, here you go: 1. My FIO job has 16 workers submitting IO in parallel The "old" job has a single worker submitting serially -- both are using AIO 2. My FIO job uses zeroed buffers, allowing the SSDs to compress data The old job uses randomized data, thus SSD compression is lower 3. My FIO job does 256KB IOs, each one filling a RAID stripe The old job does 64KB IOs, each one filling one chunk 4. My FIO job does random IOs, spreading the writes over the volume The old job does serial IOs -- the SandForce controllers have 8 channels and can write to all 8 in parallel. Writing randomly creates more opportunity for the controller to write multiple channels concurrently My FIO job simulates a large multiuser heavy concurrent IO workload. It creates 16 threads, 4 running on each core. In parallel, they submit a massive amount of random, stripe width writes, containing uniform data, asynchronously, to the block device, here the md/RAID5 device. Doing this ensures the IO pipeline is completely full all the time, with zero delays between submissions. The "old" FIO job creates a single thread which submits chunk size overlapping writes asynchronously via the io_submit() system call (libaio). Contrary to apparently popular belief, this does not allow one to send a continuous stream of overlapping writes from a single thread with no time slice gaps between the system calls. My FIO job threads use io_submit() as well, but there are 16 threads submitting in parallel, leaving no time gaps between IO submissions, with massive truly overlapping IOs. This parallel job could be run with any number of FIO engines with the same results. I stuck with AIO for direct comparison as we're doing here. Because it is sending so many more IOs per unit time than the single threaded job, the larger md stripe cache is of great benefit. The single threaded job isn't submitting sufficient IOs per unit for the larger stripe cache to make a difference. The takeaway here is not that my FIO job makes the SSD RAID faster. It simply pushes a sufficient amount of IO to demonstrate the intrinsic high throughput the array is capable of. For those fond of car analogies: the old FIO test is barely pushing on the throttle; my FIO test is hammering the pedal to the floor. Same car, same speed potential, just different amounts of load applied to the pedal. > So, it looks like it is the stripe_cache_size that is affecting > performance, and that DRBD makes no difference whether it is connected > or not. Possibly removing it completely would increase performance > somewhat, but since I actually do need it, and that is somewhat > destructive, I won't try that :) I'd do more investigating of this. DRBD can't put zero load on the array if it's doing work. Given it's a read only workload, it's possible the increased stripe cache is allowing full throttle writes while doing 100MB/s of reads, without writes being impacted. You'll need to look deeper into the md statistics and/or monitor iostat, etc, during runs with DRBD active and actually moving data. > Will stick with 4096 for the moment based on the above results. That's my recommendation. >> FIO runs on Windows: http://www.bluestop.org/fio/ > > Will check into that, it will be the ultimate end-to-end test.... Also, Yes, it will. As long as you're running at least 16-32 threads per TS client to overcome TCP/iSCSI over GbE latency, and the lack of AIO on Windows. And you can't simply reuse the same job file. The docs tell you which engine, and other settings, to use for Windows. > Hmmm, good point, I realised I could try and upgrade to the x64 windows > 2003, but I think I'd prefer to just move up to 2008 x64 (or 2012)... > For now, I'll just keep using my hacky 4GB RAM drive for the pagefile... Or violate BCP and run two TS instances per Xen, or even four, with the appropriate number of users per each. KSM will consolidate all the Windows and user application read only files (DLLs, exes, etc), yielding much more free real memory than with a single Windows TS instance. AFAIK Windows has no memory merging so you can't over commit memory other than with the page file, which is horribly less efficient than KSM. > I meant I hadn't crossed off as many items from my list of things to > do... Not that I hadn't improved performance significantly :) I know, was just poking you in the ribs. ;) >> To find out how much of the 732MB/s write throughput increase is due to >> buffering 512 stripes instead of 16, simply change it back to 256, >> re-run my FIO job file, and subtract the write result from 1292MB/s. > > So, running your FIO job file with the original 256 give a write speed > of 950MB/s and the previous FIO file gives 509MB/s. So it would seem the > increase in stripe_cache_size from 256 to 4096 give an increase in your > FIO job from 950MB/s to 1634MB/s which is a significant speed boost. I 72 percent increase with this synthetic workload, by simply increasing the stripe cache. Not bad eh? This job doesn't present an accurate picture of real world performance though, as most synthetic tests don't. Get DRBD a hump'n and your LVM snapshot(s) in place, all the normal server side load, then fire up the 32 thread FIO test on each TS VM to simulate users (I could probably knock out this job file if you like). Then monitor the array throughput with iostat or similar. This would be about as close to peak real world load as you can get. > must wonder why we have a default of 256 when this can make such a > significant performance improvement? A value of 4096 with a 5 drive raid > array is only 80MB of cache, I suspect very few users with a 5 drive > RAID array would be concerned about losing 80MB of RAM, and a 2 drive > RAID array would only use 32MB ... The stripe cache has nothing to do with device count, but hardware throughput. Did you happen to notice what occurred when you increased cache size past your 4096 sweet spot to 32768? Throughput dropped by ~500MB/s, almost 1/3rd. Likewise, for the slow rust array whose sweet spot is 512, making the default 4096 will decrease his throughput, and eat 80MB RAM for nothing. Defaults are chosen to work best with the lowest common denominator hardware, not the Ferrari. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html