Re: [PATCH v3 14/20] xfsprogs/repair: phase 2 finobt scan

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 23 Apr 2014 20:39:05 -0400

On Thu, Apr 24, 2014 at 09:06:43AM +1000, Dave Chinner wrote:
> On Wed, Apr 23, 2014 at 04:01:52PM -0400, Brian Foster wrote:
> > On Wed, Apr 23, 2014 at 04:19:45PM +1000, Dave Chinner wrote:
> > > On Thu, Apr 10, 2014 at 12:11:04PM -0400, Brian Foster wrote:
> > > > If one exists, scan the free inode btree in phase 2 of xfs_repair.
> > > > We use the same general infrastructure as for the inobt scan, but
> > > > trigger finobt chunk scan logic in in scan_inobt() via the magic
> > > > value.
> > > > 
> > > > The new scan_single_finobt_chunk() function is similar to the inobt
> > > > equivalent with some finobt specific logic. We can expect that
> > > > underlying inode chunk blocks are already marked used due to the
> > > > previous inobt scan. We can also expect to find every record
> > > > tracked by the finobt already accounted for in the in-core tree
> > > > with equivalent (and internally consistent) inobt record data.
> > > > 
> > > > Spit out a warning on any divergences from the above and add the
> > > > inodes referenced by the current finobt record to the appropriate
> > > > in-core tree.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > ....
> > > > +
> > > > +	if (!suspect) {
> > > > +		/*
> > > > +		 * inodes previously inserted into the uncertain tree should be
> > > > +		 * superceded by these when the uncertain tree is processed
> > > > +		 */
> > > > +		nfree = 0;
> > > > +		if (XFS_INOBT_IS_FREE_DISK(rp, 0)) {
> > > > +			nfree++;
> > > > +			ino_rec = set_inode_free_alloc(mp, agno, ino);
> > > > +		} else  {
> > > > +			ino_rec = set_inode_used_alloc(mp, agno, ino);
> > > > +		}
> > > > +		for (j = 1; j < XFS_INODES_PER_CHUNK; j++) {
> > > > +			if (XFS_INOBT_IS_FREE_DISK(rp, j)) {
> > > > +				nfree++;
> > > > +				set_inode_free(ino_rec, j);
> > > > +			} else  {
> > > > +				set_inode_used(ino_rec, j);
> > > > +			}
> > > > +		}
> > > > +	} else {
> > > > +		/*
> > > > +		 * this should handle the case where the inobt scan may have
> > > > +		 * already added uncertain inodes
> > > > +		 */
> > > > +		nfree = 0;
> > > > +		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > > > +			if (XFS_INOBT_IS_FREE_DISK(rp, j)) {
> > > > +				add_aginode_uncertain(mp, agno, ino + j, 1);
> > > > +				nfree++;
> > > > +			} else {
> > > > +				add_aginode_uncertain(mp, agno, ino + j, 0);
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > > +check_freecount:
> > > > +
> > > > +	if (nfree != be32_to_cpu(rp->ir_freecount)) {
> > > > +		do_warn(
> > > > +_("finobt ir_freecount/free mismatch, inode chunk %d/%u, freecount %d nfree %d\n"),
> > > > +			agno, ino, be32_to_cpu(rp->ir_freecount), nfree);
> > > > +	}
> > > > +
> > > > +	if (!nfree) {
> > > > +		do_warn(
> > > > +_("finobt record with no free inodes, inode chunk %d/%u\n"), agno, ino);
> > > > +	}
> > > 
> > > Shouldn't both of these increment suspect?
> > > 
> > 
> > So at this point of the scan, we're processing through finobt leaf
> > blocks. Setting suspect has two side effects that I can see:
> > 
> > 1.) Suppressing the additional, more granular checks for the
> > records/inodes that make up the remainder of the block.
> > 
> > 2.) Setting bad_ino_btree, which causes us to skip phases 6 and 7.
> > 
> > I think my thought process was that side effect #2 was generally too
> > heavy handed for a scenario where the freecount doesn't happen to match
> > or we've failed to remove a fully allocated record from the finobt.
> 
> Yes, that's true.
> 
> > Furthermore, we already increment suspect in this function if either the
> > allocation state of the inode chunk blocks or the allocation state of
> > any individual inode is unexpected based on the inobt state. The
> > additional value of the freecount checks is to catch any kind of
> > intra-record corruption or inter-freecount corruption across the trees,
> > which IIUC would be fixed naturally provided nothing else is wrong
> > (e.g., suspect remains 0) when the trees are regenerated.
> > 
> > In other words, if inode allocation state does differ between the trees,
> > we already do set suspect. If these warnings fire without suspect being
> > set, it means we just need to fix up the freecount and there is no
> > broader tree corruption. Thoughts?
> 
> Seems reasonable. Perhaps a comment explaining this would be a good
> idea - we dont have nearly enough comments in repair explaining how
> it is cross-validating different pieces of metadata...
> 

Indeed. I had to rethink through that a bit myself, and I wrote the
code. ;) So it's not really clear in and of itself. The assumptions
being documented also helps clarify the correctness of the previous
checks, as demonstrated by the bug below. I'll add some comments to help
clear this up.

Brian

> > NOTE:
> > 
> > When looking through this again, I think there is a hole in the
> > allocation state logic in that we don't catch a case where the finobt
> > inode could be allocated and the inobt inode marked as free:
> > 
> > 	if (first_rec) {
> > 		...
> > 
> > 		nfree = 0;
> > 		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > 			if (XFS_INOBT_IS_FREE_DISK(rp, j)) {
> > 				nfree++;
> > 				if (!suspect && !is_inode_free(first_rec, j))
> > 					suspect++;
> > 			}
> > 			/*
> > 			 * XXX: else if the finobt inode is allocated, confirm
> > 			 * here that the inobt inode is allocated as well!
> > 			 */
> > 		}
> > 		goto check_freecount;
> > 	}
> > 
> > I think something like this might work:
> > 
> > 	if (first_rec) {
> > 		...
> > 
> > 		nfree = 0;
> > 		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > 			int isfree = XFS_INOBT_IS_FREE_DISK(rp, j);
> > 
> > 			if (isfree) 
> > 				nfree++;
> > 
> > 			if (!suspect &&
> > 			    isfree != is_inode_free(first_rec, j))
> > 				suspect++;
> > 		}
> > 		goto check_freecount;
> > 	}
> 
> Yes, that makes sense - we need to validate in both directions.
> Again, a comment would be a useful reminder here. ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs