Folks, FYI, here is my current XFS patch stack that I'll be trying to get ready in time for the 2.6.38 merge window. Note that the first two patches are candidates for 2.6.37-rc. They are a perag reference counting fix and the movement of a trace point. My tree is currently based on the VFS locking changes I have out for review, so there's a couple fo patches that won't apply sanely to a mainline or OSS xfs dev tree. See below for a pointer to a git tree with all the patches in it. First patch is a per-cpu superblock counter rewrite. This uses the generic per-cpu coutner infrastructure to do the heavy lifting. Needs to be split into two patches. Following this is the dynamic speculative allocation patches. These have been rewritten to be base don the current inode size rather than a thumb-in-the-air how-many-preallocs-have-we-already-done algorithm. There are also some fixes in the second patch that fix assumptions about ip->i_delayed_blks being zero after a flush. Next up we have the inode cache RCU freeing and lookup patches, including one that avoids putting the inode in the VFS hash (similar to Christoph's patch, but using the different VFS code). Then there are buffer cache reclaim changes. First is a per-buftarg shrinker interface, followed by a lazily updated per-buftarg buffer LRU. building on this connecting up the prioritised buffer reclaim hooks that ensure more critical buffers are harder to reclaim. AIL lock contention fixes are next, with bulk AIL insert and removal functions being implemented and connected up to the transaction commit and inode buffer IO completion routines. These significantly reduce AIL lock contention, and combined with a reduction in the granularity of xfsaild push wakeups, the AIL lock drops out of the "top 10" contended locks on Û-way workloads. There's a fix to avoid error injection from burning CPU on debug kernels - with a badly fragmented freespace tree, the btree block validation was taking ~60% of the CPU time, with most of that running error injection checks. Finally, there's a patch to split up the log grant lock. This needs splitting into 4 or 5 smaller patches (as you can see it was originally from the commit log). It splits the grant lock into two list locks (reserve and write queues), and converts all the other variables that the grant lock protected into atomic variables. Grant head calculations are made atomic by converting them into 64 bit "LSNs" and the use of cmpxchg loops on atomic 64 bit variables. All log tail and sync LSNs updates are made atomic via conversion to atomic variables. With this, the grant lock goes away completely, and the transaction reserve fast path now only has two cmpxchg loops instead of a heavily contended spin lock. The result of all this is raw cpu bound 8-way create performance of just over 100,000 inodes/s, and unlink performance of over 90,000 inodes/s. 8-way dbench performance is improved from ~1150MB/s to ~1650MB/s by this patchset. For 8-way creation and unlink of small files (~50 million), the lockstat profiles look like: contended total Lock Lock acquistions acquisitions Description ----------------------------- ----------- ------------ ------------------- inode_wb_list_lock: 496330785 836287347 VFS dcache_lock: 116299583 681450027 VFS &(&vblk->lock)->rlock: 52829329 131054495 virtio block device &sb->s_type->i_lock_key#1: 41772196 2375571240 VFS (inode->i_lock) &(&cil->xc_cil_lock)->rlock: 29549897 410553961 XFS (CIL commit lock) &irq_desc_lock_class: 27520142 63908701 IRQ edge lock &(&pag->pag_buf_lock)->rlock: 11756249 1838039685 XFS (buffer cache lock) &(&dentry->d_lock)->rlock: 5735657 1225028487 VFS &(&parent->list_lock)->rlock: 4356293 249408696 VM (SLAB list lock) inode_sb_list_lock: 3616366 203712449 VFS key#5: 2075310 139221312 XFS SB percpu counter inode_hash_lock: 1529969 102359626 VFS rcu_node_level_0: 1363470 13730113 RCU &(&zone->lock)->rlock: 1247467 16469316 VM (free list lock) &(&pag->pag_ici_lock)->rlock: 770880 337090972 XFS (inode cache lock) &rq->lock: 589111 184220946 Scheduler inode_lru_lock: 527163 102791204 VFS g->l_grant_write_lock)->rlock: 526471 51279626 XFS (grant write lock) &(&pag->pagb_lock)->rlock: 402878 208861744 XFS (busy extent list) &(&zone->lru_lock)->rlock: 167692 25383748 VM (page cache LRU) &on_slab_l3_key: 166183 58470153 VM (slab cache) semaphore->lock#2: 161321 3659173925 ??? &(&ailp->xa_lock)->rlock: 143859 164470123 XFS (AIL lock) &cil->xc_ctx_lock-W: 32850 173279 XFS (CIL push lock) &cil->xc_ctx_lock-R: 90868 357572724 XFS (CIL push lock) I'm still to determine if I'll have the time to finish the removal of the page cache from the buffer cache yet - for pure inode create/unlink workloads the buftarg mapping tree lock is the second most heavily contended lock in the system. Hence this definitely needs solving in some way or another.... Anyway, comments are welcome - just keep in mind that there is still some polish required for these patches. ;) If you want the git version, everything is here: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working Dave Chinner (16): xfs: fix per-ag reference counting in inode reclaim tree walking xfs: move delayed write buffer trace [RFC] xfs: use generic per-cpu counter infrastructure xfs: dynamic speculative EOF preallocation xfs: don't truncate prealloc from frequently accessed inodes patch xfs-inode-hash-fake xfs: convert inode cache lookups to use RCU locking xfs: convert pag_ici_lock to a spin lock xfs: convert xfsbud shrinker to a per-buftarg shrinker. xfs: add a lru to the XFS buffer cache xfs: connect up buffer reclaim priority hooks xfs: bulk AIL insertion during transaction commit xfs: reduce the number of AIL push wakeups xfs: remove all the inodes on a buffer from the AIL in bulk xfs: only run xfs_error_test if error injection is active xfs: make xlog_space_left() independent of the grant lock fs/xfs/linux-2.6/xfs_buf.c | 239 ++++++++---- fs/xfs/linux-2.6/xfs_buf.h | 43 ++- fs/xfs/linux-2.6/xfs_iops.c | 11 +- fs/xfs/linux-2.6/xfs_linux.h | 9 - fs/xfs/linux-2.6/xfs_super.c | 22 +- fs/xfs/linux-2.6/xfs_sync.c | 28 +- fs/xfs/linux-2.6/xfs_trace.h | 36 +- fs/xfs/quota/xfs_dquot.c | 2 +- fs/xfs/quota/xfs_qm_syscalls.c | 3 + fs/xfs/xfs_ag.h | 2 +- fs/xfs/xfs_alloc.c | 4 +- fs/xfs/xfs_bmap.c | 9 +- fs/xfs/xfs_btree.c | 11 +- fs/xfs/xfs_buf_item.c | 17 +- fs/xfs/xfs_da_btree.c | 4 +- fs/xfs/xfs_dfrag.c | 13 + fs/xfs/xfs_error.c | 3 + fs/xfs/xfs_error.h | 5 +- fs/xfs/xfs_extfree_item.c | 85 +++-- fs/xfs/xfs_extfree_item.h | 12 +- fs/xfs/xfs_fsops.c | 4 +- fs/xfs/xfs_ialloc.c | 2 +- fs/xfs/xfs_iget.c | 55 ++- fs/xfs/xfs_inode.c | 24 +- fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_inode_item.c | 112 +++++- fs/xfs/xfs_iomap.c | 53 ++- fs/xfs/xfs_log.c | 678 +++++++++++++++++--------------- fs/xfs/xfs_log_cil.c | 9 +- fs/xfs/xfs_log_priv.h | 40 ++- fs/xfs/xfs_log_recover.c | 27 +- fs/xfs/xfs_mount.c | 837 +++++++++++----------------------------- fs/xfs/xfs_mount.h | 80 +--- fs/xfs/xfs_trans.c | 70 ++++- fs/xfs/xfs_trans.h | 2 +- fs/xfs/xfs_trans_ail.c | 189 ++++++++- fs/xfs/xfs_trans_extfree.c | 4 +- fs/xfs/xfs_trans_priv.h | 13 +- fs/xfs/xfs_vnodeops.c | 61 ++- include/linux/percpu_counter.h | 16 + lib/percpu_counter.c | 79 ++++ 41 files changed, 1593 insertions(+), 1321 deletions(-) _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs