On Monday 21 July 2008 19:56, Theodore Tso wrote: > Even worse, since if NFS is enabled in any way, shape or form, this > scheme is totally and completely screwed. So we have to the old way > of doing things if NFS is turned on. I didn't realize it would be so difficult to detect telldir coming from NFSD. NFS is the entire reason for all this posturing, so if the strategy fails for NFS then it isn't worth spending any time on it. I wrote generic btree code for ddsnap with arbitrary depth and merge on delete, but was never able to justify the time to port it to HTree during my day job. Props to clusterfs (Andreas!) for stepping up to do this. No doubt a commercial imperative helped achieve focus there. But I will not use HTree for my own filesystem work, I will use a new thing called PHTree, which is a (P)hysically stable variant of HTree, obtained by inserting a new layer of index blocks between the index nodes and the dirent blocks, the "terminal" index blocks. Each terminal index block is populated with { hash : block } pairs, each of which indicates that there is a dirent in <block> with hash <hash>. This requires one extra lookup per operation which is regretable, but it solves the physical stability problem. Directory entries will just stay in the leaf blocks in which they are first created, and only index nodes will ever be split. Because leaf blocks will typically be full instead of 75% full in HTree, the space usage ends up about the same. A more efficient hash algorithm can be used, and deletion and other operations will tend to be faster due to less thrashing of the inode table. Free space management becomes an issue: when dirents are deleted we need to be able to re-use them for new entries. To support efficient detection of available dirents of appropriate size, I will use a lazy method of recording the maximum sized dirent available in any leaf block. The actual largest free direct may be smaller than that, which will be detected when a search fails, causing the lazy limit to be updated. We can safely skip searching for free space in any block for which the lazy max is less than the size we require. One byte is sufficient for the lazy max, so one 4K block is sufficient to keep track of 2^12 * 2^12 bytes worth of directory blocks, a 16 meg directory with about half a million entries. For larger directories this structure becomes a radix tree, with lazy max recorded also at each index pointer for quick free space searching. Mind you, I haven't coded this yet and it will be a good few months before I do, so I don't have performance numbers. I just thought I should mention it now, because I expect it to meet or beat HTree overall in performance while supporting brain damaged NFS semantics in a much nicer way. Hopefully we will find out for sure before Ext4 indexing becomes cast in stone. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html