On Thu, Nov 10, 2011 at 10:11:33AM +1100, Dave Chinner wrote: > You're making the assumption that log recovery has done the correct > thing any only replayed entire unlink transactions and hence the > filesystem is otherwise consistent (i.e that there are no other > references). I think that's a bad assumption - there's no guarantee > that the unlinked list only contains unreferenced inodes if there's > been corruption and/or log replay was not able to be run. We add inodes to the uncertain list if any of the following applies a) are found in an inode btree record reachable from the root in phase2, but they are suspect based on certain factors - else we add them to the inode tree directly. b) are found on the unlinked inodes list in phase3 c) a directory found in an reachable inode btree record points to them in phase3 so any inodes that either has a link pointing to it, or an inode allocation btree record pointing to it will still be added to the uncertain inode list if they aren't on the actual inode btree yet. Then later in phase3 we move all uncertain inodes that appear fine back into the main inode record tree. > I also think there's more to it than that. The walk of the inode list > also marks all the blocks in the block map as containing inodes, and > all the blocks still used by those inodes as data/bmap/attr types. > This change removes that, so we're going to potentially lose that > state if all the inodes in a block are on the unlinked list. We still do that walk if we have any genuine reference to the inode. If we don't have any reference but the unlinked list they can be considered free - we'd free every ressources assoicated with them on log recovery anyway. > Hence we'll end up with blocks containing inodes that are still > marked as used in the AGINO btree, but are marked as free space in > the block map. They aren't. We completely rebuild both the inode allocation and space allocation bitmaps from the information we gather in the earlier repair phases, and they will be in sync. > the AG or filesystem (e.g. agi->agi_count - agi->agi_free). Yes, it will > still spin for some time on this sort of corruption, but it won't > get stuck, and it won't add new holes into our block/inode usage > tracking... This would basically take forever with thinkgs like Arek's filesystem with almost 11 million inodes in each AG. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs