On Fri, May 06, 2011 at 06:06:48PM +0800, Wu Fengguang wrote: > On Fri, May 06, 2011 at 04:42:38PM +0800, Wu Fengguang wrote: > > > patched trace-tar-dd-ext4-2.6.39-rc3+ > > > > > flush-8:0-3048 [004] 1929.981734: writeback_queue_io: bdi 8:0: older=4296600898 age=2 enqueue=13227 > > > > > vanilla trace-tar-dd-ext4-2.6.39-rc3 > > > > > flush-8:0-2911 [004] 77.158312: writeback_queue_io: bdi 8:0: older=0 age=-1 enqueue=18938 > > > > > flush-8:0-2911 [000] 82.461064: writeback_queue_io: bdi 8:0: older=0 age=-1 enqueue=6957 > > > > It looks too much to move 13227 and 18938 inodes at once. So I tried > > arbitrarily limiting the max move number to 1000 and it helps reduce > > the lock hold time and contentions a lot. > > Oh it seems 1000 is too small at least for this workload, it hurts > dd+tar+sync total elapsed time. > > no limit: > avg 167.486 > stddev 8.996 > limit=1000: > avg 171.222 > stddev 5.588 > limit=3000: > avg 165.335 > stddev 5.503 > > So use 3000 as the new limit. I don't think that's even enough. The number is going to be workload dependent and while a limit might be a good idea, I don't think it can be chosen just from one simple benchmark. e.g. what does it to do performance of workloads creating tens of thousands of small dirty files a second? .... > class name con-bounces contentions waittime-min waittime-max waittime-total acq-b > ounces acquisitions holdtime-min holdtime-max holdtime-total > ---------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------- > vanilla 2.6.39-rc3: > inode_wb_list_lock: 2063 2065 0.12 2648.66 5948.99 > 27475 943778 0.09 2704.76 498340.24 I wouldn't consider this a contended lock at all on this workload. FWIW, my profiles on sustained 8-way small file creation workloads on ext4 over tens of millions of inodes show a 0.1% contention rate for the inode_wb_list_lock. That compares to a 2% contention rate for the inode_lru_lock, a 4% contention rate on the inode_sb_list_lock and a 6% contention rate on the inode_hash_lock. So really, the inode_wb_list_lock is not the lock we need to spend effort on optimising to the nth degree right now... ...... > limit=1000: > > dd+tar+sync total elapsed time (10 runs): > avg 171.222 > stddev 5.588 > > &(&wb->list_lock)->rlock: 842 842 0.14 101.10 1013.34 > 20489 970892 0.09 234.11 509829.79 ..... > limit=3000: > > dd+tar+sync total elapsed time (10 runs): > avg 165.335 > stddev 5.503 > > &(&wb->list_lock)->rlock: 1088 1092 0.11 245.08 3268.75 > 21124 1718636 0.09 384.53 849827.20 So, from this acquisitions are doubled, and the total lock hold time has almost doubled as well. That seems like there's a fair bit of inefficiency introduced. What does it do to the CPU time consumed by queue_io() (perf top is your friend)? FYI, queue_io() is already a _massive_ CPU hog. See commit dcd79a1 ("xfs: don't use vfs writeback for pure metadata modifications") for how XFS tries to avoid putting dirty inodes on the list if at all possible: Under heavy multi-way parallel create workloads, the VFS struggles to write back all the inodes that have been changed in age order. The bdi flusher thread becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing the superblock dirty inode list to separate dirty inodes old enough to flush. We already keep an index of all metadata changes in age order - in the AIL - and continued log pressure will do age ordered writeback without any extra overhead at all. If there is no pressure on the log, the xfssyncd will periodically write back metadata in ascending disk address offset order so will be very efficient. ..... We're moving towards only tracking inodes with dirty pages in the b_dirty list for XFS because this time based expiry is so inefficient. So anything that reduces the efficiency of queue_io().... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html