On Wed, Mar 14, 2012 at 4:12 PM, Lukas Czerner <lczerner@xxxxxxxxxx> wrote: > On Tue, 13 Mar 2012, Ted Ts'o wrote: > >> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote: >> > >> > I think a format change would be preferable to runtime sorting. >> >> Are you volunteering to spearhead the design and coding of such a >> thing? Run-time sorting is backwards compatible, and a heck of a lot >> easier to code and test... > > Actually I am in the process of creating some projects for the local > universities and this certainly sounds like something that some student > can do. I have already put that into the proposal, so we might have > someone looking into this in the near future. Since the problem has been > there since forever, I think that it's not that critical to solve it > right away. > > I kind of like the idea about having the separate btree with inode > numbers for the directory reading, just because it does not affect > allocation policy nor the write performance which is a good thing. Also > it has been done before in btrfs and it works very well for them. The > only downside (aside from format change) is the extra step when removing > the directory entry, but the positives outperform the negatives. > > > Maybe this might be even done in a way which does not require format > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will > tell us that there is a inode number btree to use on directory reads. > Then the pointer to the root of that tree would be stored at the end of > the root block of the hree (of course if there is still some space for > it) and the rest is easy. dx_root takes 32 bytes from 60 bytes of i_data. The beginning of dx_entries is controlled by info_length, that is, dx_root->info + dx_root->info->info_length. It seems that we can put the new tree root behind dx_root and add the length of the new tree root to info_length. Then the old code can handle new filesystem and new code can handle old filesystem because of the flag you described above. If no one attempts to do it , I can do it as a google summer of code project. Yongqiang. > > So If we mount the old file system with the new code, there will not be > the flag, but with newly added directories it could be used anyway. Also > e2fsck could easily add the inumber tree into the directory if it is > missing. > > If we mount the new file system with the old code (the one which does > not know about this feature), everything would stay the same and we > would not touch the inumber tree at all. Of course there is a > possibility that we would rewrite the end of the root htree, overwriting > the pointer to the inumber tree and effectively losing one block. We > would not actually care that we rewrote this information because we > do not actually need it, so the only downside is the one block lost, > which is not that bad I guess. > > Now, we have rewritten the pointer to the inumber tree and we're going > to mount it with the new code again. It would expect the the inumber > tree to be there, but not blindly. Some king of magic information would > have to be stored in the inumber root pointer so we can be sure that it > really is what we're looking for and if it is not there, well then we do > not care and walk the directory in the old way. And we can alway create > the inumber tree in the new directory. And again e2fsck should be able > to fix that for us as well. > > So that is just an idea, I have not looked at the actual code to see it > it would be really possible, but it certainly seems like so. Am I > missing something ? > > > Anyway, if there is going to be a format change I agree that having > solution for those who are not comfortable with the format change is > equally as important. But maybe there will be better solution which does > not require the format change. > > Thanks! > -Lukas > >> >> The reality is we'd probably want to implement run-time sorting >> *anyway*, for the vast majority of people who don't want to convert to >> a new incompatible file system format. (Even if you can do the >> conversion using e2fsck --- which is possible, but it would be even >> more code to write --- system administrators tend to be very >> conservative about such things, since they might need to boot an older >> kernel, or use a rescue CD that doesn't have an uptodate kernel or >> file system utilities, etc.) >> >> > So the index nodes contain the hash ranges for the leaf block, but >> > the leaf block only contains the regular directory entries, not a >> > hash for each name? That would mean that adding or removing names >> > would require moving around the regular directory entries wouldn't >> > it? >> >> They aren't sorted in the leaf block, so we only need to move around >> regular directory entries when we do a node split (and at the moment >> we don't support shrinking directories), so we don't have to worry the >> reverse case. >> >> > I would think that hash collisions are rare enough that reading a >> > directory block you end up not needing once in a blue moon would be >> > chalked up under "who cares". So just stick with hash, offset pairs >> > to map the hash to the normal directory entry. >> >> With a 64-bit hash, and if we were actually going to implement this as >> a new incompatible feature, you're probably right in terms of >> accepting the extra directory block search. >> >> We would still have to implement the case where hash collisions *do* >> exist, though, and make sure the right thing happens in that case. >> Even if the chance of that happening is 1 in 2**32, with enough >> deployed systems (i.e., every Android handset, etc.) it's going to >> happen in real life. >> >> - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Wishes Yongqiang Yang -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html