On certain types of hardware, issuing a write cache flush takes a considerable amount of time. Typically, these are simple storage systems with write cache enabled and no battery to save that cache after a power failure. When we encounter a system with many I/O threads that write data and then call fsync after more transactions accumulate, ext4_sync_file performs a data-only flush, the performance of which is suboptimal because each of those threads issues its own flush command to the drive instead of trying to coordinate the flush, thereby wasting execution time. Instead of each fsync call initiating its own flush, there's now a flag to indicate if (0) no flushes are ongoing, (1) we're delaying a short time to collect other fsync threads, or (2) we're actually in-progress on a flush. So, if someone calls ext4_sync_file and no flushes are in progress, the flag shifts from 0->1 and the thread delays for a short time to see if there are any other threads that are close behind in ext4_sync_file. After that wait, the state transitions to 2 and the flush is issued. Once that's done, the state goes back to 0 and a completion is signalled. Those close-behind threads see the flag is already 1, and go to sleep until the completion is signalled. Instead of issuing a flush themselves, they simply wait for that first thread to do it for them. If they see that the flag is 2, they wait for the current flush to finish, and start over. However, there are a couple of exceptions to this rule. First, there exist high-end storage arrays with battery-backed write caches for which flush commands take very little time (< 2ms); on these systems, performing the coordination actually lowers performance. Given the earlier patch to the block layer to report low-level device flush times, we can detect this situation and have all threads issue flushes without coordinating, as we did before. The second case is when there's a single thread issuing flushes, in which case it can skip the coordination. This author of this patch is aware that jbd2 has a similar flush coordination scheme for journal commits. An earlier version of this patch simply created a new empty journal transaction and committed it, but that approach was shown to increase the amount of write traffic heading towards the disk, which in turn lowered performance considerably, especially in the case where directio was in use. Therefore, this patch adds the coordination code directly to ext4. To test the performance and safety of this patchset, I crafted an ffsb profile named fsync-happy that performs a bunch of disk I/O with periodic fsync()s to flush the data out to disk. Performance results can be seen here: http://bit.ly/fYAclV The data presented in blue text represent results obtained on high performance disk arrays that have battery-backed write cache enabled. Red results on the "speed differences" page represent performance regressions, of course. Descriptions of the disk hardware tested are on the rightmost page. In no case were any of the benchmarks CPU-bound. The speed differences page shows some interesting results. Before Tejun Heo's barrier -> flush conversion in 2.6.37-rc1, we saw that enabling barriers caused between a 30-80 percent performance regression on a fairly large variety of test programs; generally, the more fsyncs, the bigger the drop; if one never fsyncs any data, the only flushes that ever happen are during the periodic journal commits. Now we see that the cost of enabling flushes in ext4 on the fsync-happy workload has dropped from about 80 percent to about 25-30 percent. With this fsync coordination patch, that drop becomes about 5-14 percent. I see some small performance (< 1 percent) regressions for some hardware. This is generally acceptable because I see larger variances from repeatedly running fsync-happy. The two larger regressions (elm3a4_ipr_nowc and elm3c44_sata_nowc) are a somewhat questionable case because those two disks have no write cache yet ext4 was not properly detecting this and setting barrier=0. That bug will be addressed separately. In terms of data safety, I've been performing power failure testing with a bunch of blades that have slow IDE disks with fairly large write caches. So far I haven't seen any more FS errors after reset than I see with 2.6.36. This patchset consists of four patches. The first adds to the block layer the ability to measure the amount of time it takes for a lower-level block device to issue and complete a flush command. The middle two patches add to md and dm, respectively, the ability to report the component devices' flush times. For 2.6.37-rc3, the md patch also requires my earlier patch to md to enable REQ_FLUSH support. The fourth patch adds the auto-tuning fsync flush coordination to ext4. To everyone who has reviewed this patch set so far, thank you for your help! As usual, I welcome any questions or comments. --D -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html