This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "XFS development tree". The branch, master has been updated a9c7b13 xfs: pack xfs_buf structure more tightly c6942de xfs: convert buffer cache hash to rbtree 6c97772 xfs: serialise inode reclaim within an AG e1a48db xfs: batch inode reclaim lookup c727163 xfs: implement batched inode lookups for AG walking 7227905 xfs: split out inode walk inode grabbing fa78a91 xfs: split inode AG walking into separate code for reclaim 7608770 xfs: remove buftarg hash for external devices 00d42de xfs: use unhashed buffers for size checks ec09a3c xfs: kill XBF_FS_MANAGED buffers 075a968 xfs: store xfs_mount in the buftarg instead of in the xfs_buf 0c6b79a xfs: introduced uncached buffer read primitve e601d2f xfs: rename xfs_buf_get_nodaddr to be more appropriate 0c9a0e0 xfs: don't use vfs writeback for pure metadata modifications ec9cb17 xfs: lockless per-ag lookups c07719e xfs: remove debug assert for per-ag reference counting 1c34652 xfs: reduce the number of CIL lock round trips during commit 3881f5f xfs: force background CIL push under sustained load from e89318c670af3959db3aa483da509565f5a2536c (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- commit a9c7b1373fab80a039c11af9683d49a557825f61 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 19:59:15 2010 +1000 xfs: pack xfs_buf structure more tightly pahole reports the struct xfs_buf has quite a few holes in it, so packing the structure better will reduce the size of it by 16 bytes. Also, move all the fields used in cache lookups into the first cacheline. Before on x86_64: /* size: 320, cachelines: 5 */ /* sum members: 298, holes: 6, sum holes: 22 */ After on x86_64: /* size: 304, cachelines: 5 */ /* padding: 6 */ /* last cacheline: 48 bytes */ Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit c6942de96cd4b9cd03f26fd016a6fb7d275992d4 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 19:59:04 2010 +1000 xfs: convert buffer cache hash to rbtree The buffer cache hash is showing typical hash scalability problems. In large scale testing the number of cached items growing far larger than the hash can efficiently handle. Hence we need to move to a self-scaling cache indexing mechanism. I have selected rbtrees for indexing becuse they can have O(log n) search scalability, and insert and remove cost is not excessive, even on large trees. Hence we should be able to cache large numbers of buffers without incurring the excessive cache miss search penalties that the hash is imposing on us. To ensure we still have parallel access to the cache, we need multiple trees. Rather than hashing the buffers by disk address to select a tree, it seems more sensible to separate trees by typical access patterns. Most operations use buffers from within a single AG at a time, so rather than searching lots of different lists, separate the buffer indexes out into per-AG rbtrees. This means that searches during metadata operation have a much higher chance of hitting cache resident nodes, and that updates of the tree are less likely to disturb trees being accessed on other CPUs doing independent operations. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 6c977723efe0db8f028f674f2701a7f8ddb5d258 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Sep 27 11:09:51 2010 +1000 xfs: serialise inode reclaim within an AG Memory reclaim via shrinkers has a terrible habit of having N+M concurrent shrinker executions (N = num CPUs, M = num kswapds) all trying to shrink the same cache. When the cache they are all working on is protected by a single spinlock, massive contention an slowdowns occur. Wrap the per-ag inode caches with a reclaim mutex to serialise reclaim access to the AG. This will block concurrent reclaim in each AG but still allow reclaim to scan multiple AGs concurrently. Allow shrinkers to move on to the next AG if it can't get the lock, and if we can't get any AG, then start blocking on locks. To prevent reclaimers from continually scanning the same inodes in each AG, add a cursor that tracks where the last reclaim got up to and start from that point on the next reclaim. This should avoid only ever scanning a small number of inodes at the satart of each AG and not making progress. If we have a non-shrinker based reclaim pass, ignore the cursor and reset it to zero once we are done. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit e1a48dbec9ba6aa24ae61d4b8d412b2b39b2baa9 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 19:51:50 2010 +1000 xfs: batch inode reclaim lookup Batch and optimise the per-ag inode lookup for reclaim to minimise scanning overhead. This involves gang lookups on the radix trees to get multiple inodes during each tree walk, and tighter validation of what inodes can be reclaimed without blocking befor we take any locks. This is based on ideas suggested in a proof-of-concept patch posted by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit c7271639bcbc3246e8afbd74746d32f1a507782e Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Sep 28 12:28:19 2010 +1000 xfs: implement batched inode lookups for AG walking With the reclaim code separated from the generic walking code, it is simple to implement batched lookups for the generic walk code. Separate out the inode validation from the execute operations and modify the tree lookups to get a batch of inodes at a time. Reclaim operations will be optimised separately. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 722790573bde4611dd1a3439d6f4e42d3c0cc65f Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Sep 28 12:28:06 2010 +1000 xfs: split out inode walk inode grabbing When doing read side inode cache walks, the code to validate and grab an inode is common to all callers. Split it out of the execute callbacks in preparation for batching lookups. Similarly, split out the inode reference dropping from the execute callbacks into the main lookup look to be symmetric with the grab. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit fa78a9124f57e85382b942b183ce2cf0a691d71a Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 18:40:15 2010 +1000 xfs: split inode AG walking into separate code for reclaim The reclaim walk requires different locking and has a slightly different walk algorithm, so separate it out so that it can be optimised separately. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 7608770b317d97702410477db31c159739171b00 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: remove buftarg hash for external devices For RT and external log devices, we never use hashed buffers on them now. Remove the buftarg hash tables that are set up for them. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 00d42de4a2117d16c16750718242819e65889262 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: use unhashed buffers for size checks When we are checking we can access the last block of each device, we do not need to use cached buffers as they will be tossed away immediately. Use uncached buffers for size checks so that all IO prior to full in-memory structure initialisation does not use the buffer cache. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit ec09a3c36986a2bf2431e835870f499ba0074991 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: kill XBF_FS_MANAGED buffers Filesystem level managed buffers are buffers that have their lifecycle controlled by the filesystem layer, not the buffer cache. We currently cache these buffers, which makes cleanup and cache walking somewhat troublesome. Convert the fs managed buffers to uncached buffers obtained by via xfs_buf_get_uncached(), and remove the XBF_FS_MANAGED special cases from the buffer cache. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 075a96845b43ff609476cc26d466d2e6c020eac5 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: store xfs_mount in the buftarg instead of in the xfs_buf Each buffer contains both a buftarg pointer and a mount pointer. If we add a mount pointer into the buftarg, we can avoid needing the b_mount field in every buffer and grab it from the buftarg when needed instead. This shrinks the xfs_buf by 8 bytes. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 0c6b79a05107490af559c9e5bfa6b906e910e1bf Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 21:58:31 2010 +1000 xfs: introduced uncached buffer read primitve To avoid the need to use cached buffers for single-shot or buffers cached at the filesystem level, introduce a new buffer read primitive that bypasses the cache an reads directly from disk. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit e601d2feccfb957cc95dbb151f434ca390b43949 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 20:07:47 2010 +1000 xfs: rename xfs_buf_get_nodaddr to be more appropriate xfs_buf_get_nodaddr() is really used to allocate a buffer that is uncached. While it is not directly assigned a disk address, the fact that they are not cached is a more important distinction. With the upcoming uncached buffer read primitive, we should be consistent with this disctinction. While there, make page allocation in xfs_buf_get_nodaddr() safe against memory reclaim re-entrancy into the filesystem by allowing a flags parameter to be passed. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 0c9a0e0cdba9677ff78a2ec28f5ff8b4db530dd6 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Tue Sep 28 12:27:25 2010 +1000 xfs: don't use vfs writeback for pure metadata modifications Under heavy multi-way parallel create workloads, the VFS struggles to write back all the inodes that have been changed in age order. The bdi flusher thread becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing the superblock dirty inode list to separate dirty inodes old enough to flush. We already keep an index of all metadata changes in age order - in the AIL - and continued log pressure will do age ordered writeback without any extra overhead at all. If there is no pressure on the log, the xfssyncd will periodically write back metadata in ascending disk address offset order so will be very efficient. Hence we can stop marking VFS inodes dirty during transaction commit or when changing timestamps during transactions. This will keep the inodes in the superblock dirty list to those containing data or unlogged metadata changes. However, the timstamp changes are slightly more complex than this - there are a couple of places that do unlogged updates of the timestamps, and the VFS need to be informed of these. Hence add a new function xfs_trans_ichgtime() for transactional changes, and leave xfs_ichgtime() for the non-transactional changes. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> commit ec9cb17171ce6179f788a28a3bf4614678305715 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: lockless per-ag lookups When we start taking a reference to the per-ag for every cached buffer in the system, kernel lockstat profiling on an 8-way create workload shows the mp->m_perag_lock has higher acquisition rates than the inode lock and has significantly more contention. That is, it becomes the highest contended lock in the system. The perag lookup is trivial to convert to lock-less RCU lookups because perag structures never go away. Hence the only thing we need to protect against is tree structure changes during a grow. This can be done simply by replacing the locking in xfs_perag_get() with RCU read locking. This removes the mp->m_perag_lock completely from this path. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit c07719e7fe1ca3bf98b89e8798ded068fe911ea1 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Wed Sep 22 10:47:20 2010 +1000 xfs: remove debug assert for per-ag reference counting When we start taking references per cached buffer to the the perag it is cached on, it will blow the current debug maximum reference count assert out of the water. The assert has never caught a bug, and we have tracing to track changes if there ever is a problem, so just remove it. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 1c34652755dd670b6a1db00c7d14f9511eeecc00 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 18:14:13 2010 +1000 xfs: reduce the number of CIL lock round trips during commit When commiting a transaction, we do a lock CIL state lock round trip on every single log vector we insert into the CIL. This is resulting in the lock being as hot as the inode and dcache locks on 8-way create workloads. Rework the insertion loops to bring the number of lock round trips to one per transaction for log vectors, and one more do the busy extents. Also change the allocation of the log vector buffer not to zero it as we copy over the entire allocated buffer anyway. This patch also includes a structural cleanup to the CIL item insertion provided by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> commit 3881f5f7fc84d444a0ff45b4bffc3c2d012703ce Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Fri Sep 24 18:13:44 2010 +1000 xfs: force background CIL push under sustained load I have been seeing occasional pauses in transaction throughput up to 30s long under heavy parallel workloads. The only notable thing was that the xfsaild was trying to be active during the pauses, but making no progress. It was running exactly 20 times a second (on the 50ms no-progress backoff), and the number of pushbuf events was constant across this time as well. IOWs, the xfsaild appeared to be stuck on buffers that it could not push out. Further investigation indicated that it was trying to push out inode buffers that were pinned and/or locked. The xfsbufd was also getting woken at the same frequency (by the xfsaild, no doubt) to push out delayed write buffers. The xfsbufd was not making any progress because all the buffers in the delwri queue were pinned. This scan- and-make-no-progress dance went one in the trace for some seconds, before the xfssyncd came along an issued a log force, and then things started going again. However, I noticed something strange about the log force - there were way too many IO's issued. 516 log buffers were written, to be exact. That added up to 129MB of log IO, which got me very interested because it's almost exactly 25% of the size of the log. He delayed logging code is suppose to aggregate the minimum of 25% of the log or 8MB worth of changes before flushing. That's what really puzzled me - why did a log force write 129MB instead of only 8MB? Essentially what has happened is that no CIL pushes had occurred since the previous tail push which cleared out 25% of the log space. That caused all the new transactions to block because there wasn't log space for them, but they kick the xfsaild to push the tail. However, the xfsaild was not making progress because there were buffers it could not lock and flush, and the xfsbufd could not flush them because they were pinned. As a result, both the xfsaild and the xfsbufd could not move the tail of the log forward without the CIL first committing. The cause of the problem was that the background CIL push, which should happen when 8MB of aggregated changes have been committed, is being held off by the concurrent transaction commit load. The background push does a down_write_trylock() which will fail if there is a concurrent transaction commit holding the push lock in read mode. With 8 CPUs all doing transactions as fast as they can, there was enough concurrent transaction commits to hold off the background push until tail-pushing could no longer free log space, and the halt would occur. It should be noted that there is no reason why it would halt at 25% of log space used by a single CIL checkpoint. This bug could definitely violate the "no transaction should be larger than half the log" requirement and hence result in corruption if the system crashed under heavy load. This sort of bug is exactly the reason why delayed logging was tagged as experimental.... The fix is to start blocking background pushes once the threshold has been exceeded. Rework the threshold calculations to keep the amount of log space a CIL checkpoint can use to below that of the AIL push threshold to avoid the problem completely. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Alex Elder <aelder@xxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> ----------------------------------------------------------------------- Summary of changes: fs/xfs/linux-2.6/xfs_buf.c | 200 +++++++++++--------- fs/xfs/linux-2.6/xfs_buf.h | 50 +++--- fs/xfs/linux-2.6/xfs_ioctl.c | 2 +- fs/xfs/linux-2.6/xfs_iops.c | 35 ---- fs/xfs/linux-2.6/xfs_super.c | 15 +- fs/xfs/linux-2.6/xfs_sync.c | 413 +++++++++++++++++++++++----------------- fs/xfs/linux-2.6/xfs_sync.h | 4 +- fs/xfs/linux-2.6/xfs_trace.h | 4 +- fs/xfs/quota/xfs_qm_syscalls.c | 14 +-- fs/xfs/xfs_ag.h | 9 + fs/xfs/xfs_attr.c | 31 +-- fs/xfs/xfs_buf_item.c | 3 +- fs/xfs/xfs_fsops.c | 11 +- fs/xfs/xfs_inode.h | 1 - fs/xfs/xfs_inode_item.c | 9 - fs/xfs/xfs_log.c | 3 +- fs/xfs/xfs_log_cil.c | 244 +++++++++++++----------- fs/xfs/xfs_log_priv.h | 37 ++-- fs/xfs/xfs_log_recover.c | 19 +- fs/xfs/xfs_mount.c | 152 ++++++++------- fs/xfs/xfs_mount.h | 2 + fs/xfs/xfs_rename.c | 12 +- fs/xfs/xfs_rtalloc.c | 29 ++-- fs/xfs/xfs_trans.h | 1 + fs/xfs/xfs_trans_inode.c | 30 +++ fs/xfs/xfs_utils.c | 4 +- fs/xfs/xfs_vnodeops.c | 23 ++- 27 files changed, 732 insertions(+), 625 deletions(-) hooks/post-receive -- XFS development tree _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs