On Sep 15, 2009 17:57 +0800, Zhang Huan wrote: > I'm reading EXT4 codes and has some questions about readdir > implementation. > > Why traverse the directory in hash order? This brings lots of code to > build and traverse a red-black tree. Why not just plainly traverse the > directory's blocks? Because htree does not maintain constant ordering of directory entries. Consider a readdir running at the same time as files are being added: readdir proc create proc read block 0 read block 1 create new file hash of filename puts entry into block 0 block 0 is full, split it add new block 3 copy 1/2 of block 0 entries to block 3 add new filename to block 0 read block 2 read block 3 When the readdir returns block 3, 1/2 of the entries in that block are the same as were returned with the original block 0 read. This is a violation of POSIX readdir() semantics that require each existing entry only be returned a single time (it doesn't matter if the new filename is returned or not). > Since the red-black tree is built every time a NFS readdir request comes > in, in case of hash collision, the nfs client may receive duplicate dir > entries if the buffer is not large enough to return all entries with the > same hash value in once. This is because NFSv2 only has a 32-bit cookie value, and if there is a whole buffer full of values with the same 32-bit hash there isn't anything that can be done about it except drop those extra duplicates (a very rare case). It would have a much more serious problem if there was a directory larger than 2^32 bytes in size, which would result in all entries beyond 2GB being lost. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html