On Sat, Apr 19, 2008 at 10:56 PM, Theodore Tso <tytso@xxxxxxx> wrote: > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: > > If it is a block containing a metadata object fsck has already read, > > than we already know what kind of object it is (there must be a way > > to quickly find all cached objects derived from a given block), and > > can update the cached version. And if fsck has not yet read the > > block, it can just be ignored, no matter what kind of data it > > contains. If it contains metadata and fsck is intrested in it, it > > will read it sooner or later anyway. If it contains file data, why > > should fsck even care? > > The problem is that e2fsck makes calculations on the filesystem data > read out from the disk and stores that in a highly compressed format. > So it doesn't remember that block #12345 was an indirect block for > inode #123, and that it contained data block numbers 17, 42, and 45. > Instead it just marks blocks #12345, #17, #42, and #45 as in use, and > then moves on. > > If you are going to store all of the cached objects then you will need > to effectively store *all* of the filesystem metatdata in memory at > the same time. For a large filesystem, you won't have enough *room* > in memory store all of the cached objects. That's one of the reasons > why e2fsck has a lot of very clever design so that summary information > can be stored in a very compressed form in memory so that things can > be fast (by avoid re-reading objects from disk) as well as not > requiring vast amounts of memory. > Yes, I agree on this problem. Do you have any estimates on how much RAM the current e2fsck uses in some test cases? I hope my approach will not add much to this. The only big thing I see is the data needed to associate each inode/dir entry with the parent block. Probably one radix tree to enumerate the blocks and a pointer added to the ext2_inode and ext2_dir_entry structures to form a linked list of objects belonging to the same block. Still no idea how much RAM the whole thing would consume. > Even if you *do* store all of the cached objects, it still takes time > to examine all of the objects and in the mean time, more changes will > have come rolling in, and you will either need to add a huge amount of > dependency to figure out what internal data structures need to be > updated based on the changes in some of the cached objects --- or you > will end up restarting the e2fsck checking process from scratch. > Not really. In my application I propose some changes to the fsck pass order to avoid the need to rerun it. And I don't get what dependency you are talking about. The only one I see is between the directory entries and the directory inode. Should not be hard to solve. (Or do I miss something? Could you give more examples maybe?) > In either case, there is still the issue of knowing exactly whether a > particular read happened before or after some change in the > filesystem. This race condition is a really hard one to deal with, > especially on a multiple CPU system and the filesystem checker is > running in userspace. I don't see why should fsck care about this. The notification is always sent after the write happened, so fsck should just re-read the data. No problem if it already read the (half-)updated version just before the notification. Btw, how about an even simplyer method: just watch the journal commits (changes to jbd needed). This way we can get all actual metadata updates, without being flooded by the file data updates. > > > But you are probably right, this project may be not doable in just three > > months. The changes on the kernel side probably are, but there is a > > huge e2fsck work. > > Yes, that is the concern. And without implementing the user-space > side, you'll never besure whether you completely got the kernel side > changes right! > > Regards, > > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html