This is an automated email from the git hooks/post-receive script. It was generated because a ref change was pushed to the repository containing the project "XFS development tree". The branch, for-linus has been updated d69043c xfs: drop buffer io reference when a bad bio is built 3daed8b xfs: fix broken error handling in xfs_vm_writepage 42e2976 xfs: fix attr tree double split corruption from 6ce377afd1755eae5c93410ca9a1121dfead7b87 (commit) Those revisions listed above that are new to this repository have not appeared on any other notification email; so we list those revisions in full, below. - Log ----------------------------------------------------------------- commit d69043c42d8c6414fa28ad18d99973aa6c1c2e24 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Nov 12 22:09:46 2012 +1100 xfs: drop buffer io reference when a bad bio is built Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@xxxxxxxxxxxxxxx> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 3daed8bc3e49b9695ae931b9f472b5b90d1965b3 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Nov 12 22:09:45 2012 +1100 xfs: fix broken error handling in xfs_vm_writepage When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> commit 42e2976f131d65555d5c1d6c3d47facc63577814 Author: Dave Chinner <dchinner@xxxxxxxxxx> Date: Mon Nov 12 22:09:44 2012 +1100 xfs: fix attr tree double split corruption In certain circumstances, a double split of an attribute tree is needed to insert or replace an attribute. In rare situations, this can go wrong, leaving the attribute tree corrupted. In this case, the attr being replaced is the last attr in a leaf node, and the replacement is larger so doesn't fit in the same leaf node. When we have the initial condition of a node format attribute btree with two leaves at index 1 and 2. Call them L1 and L2. The leaf L1 is completely full, there is not a single byte of free space in it. L2 is mostly empty. The attribute being replaced - call it X - is the last attribute in L1. The way an attribute replace is executed is that the replacement attribute - call it Y - is first inserted into the tree, but has an INCOMPLETE flag set on it so that list traversals ignore it. Once this transaction is committed, a second transaction it run to atomically mark Y as COMPLETE and X as INCOMPLETE, so that a traversal will now find Y and skip X. Once that transaction is committed, attribute X is then removed. So, the initial condition is: +--------+ +--------+ | L1 | | L2 | | fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 | | fsp: 0 | | fsp: N | |--------| |--------| | attr A | | attr 1 | |--------| |--------| | attr B | | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr X | | attr n | +--------+ +--------+ So now we go to replace X, and see that L1:fsp = 0 - it is full so we can't insert Y in the same leaf. So we record the the location of attribute X so we can track it for later use, then we split L1 into L1 and L3 and reblance across the two leafs. We end with: +--------+ +--------+ +--------+ | L1 | | L3 | | L2 | | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: N | |--------| |--------| |--------| | attr A | | attr X | | attr 1 | |--------| +--------+ |--------| | attr B | | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ And we track that the original attribute is now at L3:0. We then try to insert Y into L1 again, and find that there isn't enough room because the new attribute is larger than the old one. Hence we have to split again to make room for Y. We end up with this: +--------+ +--------+ +--------+ +--------+ | L1 | | L4 | | L3 | | L2 | | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: J | | fsp: N | |--------| |--------| |--------| |--------| | attr A | | attr Y | | attr X | | attr 1 | |--------| + INCOMP + +--------+ |--------| | attr B | +--------+ | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ And now we have the new (incomplete) attribute @ L4:0, and the original attribute at L3:0. At this point, the first transaction is committed, and we move to the flipping of the flags. This is where we are supposed to end up with this: +--------+ +--------+ +--------+ +--------+ | L1 | | L4 | | L3 | | L2 | | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 | | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 | | fsp: M | | fsp: J | | fsp: J | | fsp: N | |--------| |--------| |--------| |--------| | attr A | | attr Y | | attr X | | attr 1 | |--------| +--------+ + INCOMP + |--------| | attr B | +--------+ | attr 2 | |--------| |--------| .......... .......... |--------| |--------| | attr W | | attr n | +--------+ +--------+ But that doesn't happen properly - the attribute tracking indexes are not pointing to the right locations. What we end up with is both the old attribute to be removed pointing at L4:0 and the new attribute at L4:1. On a debug kernel, this assert fails like so: XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725 because the new attribute location does not exist. On a production kernel, this goes unnoticed and the code proceeds ahead merrily and removes L4 because it thinks that is the block that is no longer needed. This leaves the hash index node pointing to entries L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free space, and so everything is busted. This corruption is caused by the removal of the old attribute triggering a join - it joins everything correctly but then frees the wrong block. xfs_repair will report something like: bad sibling back pointer for block 4 in attribute fork for inode 131 problem with attribute contents in inode 131 would clear attr fork bad nblocks 8 for inode 131, would reset to 3 bad anextents 4 for inode 131, would reset to 0 The problem lies in the assignment of the old/new blocks for tracking purposes when the double leaf split occurs. The first split tries to place the new attribute inside the current leaf (i.e. "inleaf == true") and moves the old attribute (X) to the new block. This sets up the old block/index to L1:X, and newly allocated block to L3:0. It then moves attr X to the new block and tries to insert attr Y at the old index. That fails, so it splits again. With the second split, the rebalance ends up placing the new attr in the second new block - L4:0 - and this is where the code goes wrong. What is does is it sets both the new and old block index to the second new block. Hence it inserts attr Y at the right place (L4:0) but overwrites the current location of the attr to replace that is held in the new block index (currently L3:0). It over writes it with L4:1 - the index we later assert fail on. Hopefully this table will show this in a foramt that is a bit easier to understand: Split old attr index new attr index vanilla patched vanilla patched before 1st L1:26 L1:26 N/A N/A after 1st L3:0 L3:0 L1:26 L1:26 after 2nd L4:0 L3:0 L4:1 L4:0 ^^^^ ^^^^ wrong wrong The fix is surprisingly simple, for all this analysis - just stop the rebalance on the out-of leaf case from overwriting the new attr index - it's already correct for the double split case. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx> Signed-off-by: Ben Myers <bpm@xxxxxxx> ----------------------------------------------------------------------- Summary of changes: fs/xfs/xfs_aops.c | 54 ++++++++++++++++++++++++++++++++++-------------- fs/xfs/xfs_attr_leaf.c | 20 ++++++++++++++++-- fs/xfs/xfs_buf.c | 14 +++++++++++-- 3 files changed, 69 insertions(+), 19 deletions(-) hooks/post-receive -- XFS development tree _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs