On 2011.03.02 at 09:58 +0100, Tejun Heo wrote: > On Wed, Mar 02, 2011 at 10:30:57AM +0300, Michael Tokarev wrote: > > > I believe the way the block layer uses it, basically it only saves the > > > overhead of one transaction to the drive. It might be significant on > > > some workloads (especially on high IOPS drives like SSDs) but it's > > > likely not a huge deal. > > > > One transaction per what? If it means extra, especially "large" > > transaction (lile flush with a wait) per each fsync-like call, > > that can be huge deal actually, especially on database-like > > workloads (lots of small syncronous random writes). > > The way flushes are used by filesystems is that FUA is usually only > used right after another FLUSH. ie. Using FUA replaces FLUSH + commit > block write + FLUSH sequence to FLUSH + FUA commit block write. Due > to the preceding FLUSH, the cache is already empty, so the only > difference between WRITE + FLUSH and FUA WRITE becomes the extra > command issue overhead which is usually almost unnoticeable compared > to the actual IO. > > Another thing is that with the recent updates to block FLUSH handling, > using FUA might even be less efficient. The new implementation > aggressively merges those commit writes and flushes. IOW, depending > on timing, multiple consecutive commit writes can be merged as, > > FLUSH + commit writes + FLUSH > > or > > FLUSH + some commit writes + FLUSH + other commit writes + FLUSH > > and so on, > > These merges will happen with fsync heavy workloads where FLUSH > performance actually matters and, in these scenarios, FUA writes is > less effective because it puts extra ordering restrictions on each FUA > write. ie. With surrounding FLUSHes, the drive is free to reorder > commit writes to maximize performance, with FUA, the disk has to jump > around all over the place to execute each command in the exact issue > order. > > I personally think FUA is a misfeature. It's a microoptimization with > shallow benefits even when used properly while putting much heavier > restriction on actual IO order, which usually is the slow part. Thanks for the detailed information. Just to confirm your point here are some benchmark results: (Seagate ST1500DL003 1.5TB 5900rpm, xfs (delaylog), ffsb ( http://sourceforge.net/projects/ffsb/ ) pure random write benchmark: 1) Total Results 30sec run, 1 thread, 104*35MB files Op Name Transactions Trans/sec % Trans % Op Weight Throughput ======= ============ ========= ======= =========== ========== FUA: write : 435456 1183.44 100.000% 100.000% 162MB/sec no FUA: write : 441600 1243.47 100.000% 100.000% 170MB/sec System Call Latency statistics in millisecs Min Avg Max Total Calls ======== ======== ======== ============ [ write]FUA 0.000000 0.070392 5444.638184 435456 [ write]no FUA 0.000000 0.069718 4715.519043 441600 2) Total Results 240sec run, 2 threads, 104*35MB files =============== Op Name Transactions Trans/sec % Trans % Op Weight Throughput ======= ============ ========= ======= =========== ========== FUA: write : 594944 919.45 100.000% 100.000% 126MB/sec no FUA: write : 653824 1097.31 100.000% 100.000% 150MB/sec System Call Latency statistics in millisecs Min Avg Max Total Calls ======== ======== ======== ============ [ write]FUA 0.000000 0.812704 13467.903320 594944 [ write]no FUA 0.000000 0.727761 9695.806641 653824 -- Markus
# Large file random writes. # 104 files, 35MB per file. time=240 alignio=1 [filesystem0] location=/var/tmp/fs_bench num_files=104 min_filesize=36700160 # 35 MB max_filesize=36700160 reuse=1 [end0] [threadgroup0] num_threads=2 write_random=1 write_weight=1 write_size=1048576 # 1 MB write_blocksize=4096 [stats] enable_stats=1 enable_range=1 msec_range 0.00 0.01 msec_range 0.01 0.02 msec_range 0.02 0.05 msec_range 0.05 0.10 msec_range 0.10 0.20 msec_range 0.20 0.50 msec_range 0.50 1.00 msec_range 1.00 2.00 msec_range 2.00 5.00 msec_range 5.00 10.00 msec_range 10.00 20.00 msec_range 20.00 50.00 msec_range 50.00 100.00 msec_range 100.00 200.00 msec_range 200.00 500.00 msec_range 500.00 1000.00 msec_range 1000.00 2000.00 msec_range 2000.00 5000.00 msec_range 5000.00 10000.00 [end] [end0]