On Tue, Sep 15, 2009 at 10:53:37AM -0400, Theodore Tso wrote: > On Tue, Sep 15, 2009 at 05:57:24PM +0800, Zhang Huan wrote: > > Hi all, > > > > I'm reading EXT4 codes and has some questions about readdir > > implementation. > > > > Why traverse the directory in hash order? This brings lots of code to > > build and traverse a red-black tree. Why not just plainly traverse the > > directory's blocks? > > > > Since the red-black tree is built every time a NFS readdir request comes > > in, in case of hash collision, the nfs client may receive duplicate dir > > entries if the buffer is not large enough to return all entries with the > > same hash value in once. > > The reason why we have to do this is because of the telldir() and > seekdir() interfaces. NFS is implicated in this as well, because it > traverses the directory using a 32-bit offset on the protocol wire. > > The fundamental problem is they presume that directory is stored as a > linear array. For file systems which use some kind of B-tree or other > non-linear data structure to speed up directory lookups, the problem > is what to do when we need to split a B-tree leaf. When this happens, > half of the directory entries in the that B-tree leaf must be moved to > a new directory block. However, readdir() must still return every > single entry in the directory once and exactly once; and this must be > true even if even if telldir()/seekdir() is used, and even if NFS > decides to wait for several minutes or hours between reading the first > set of directory entries, and then reading the next set of directory > entries, sending to the NFS server nothing but the 32-bit offset > token. > > It is for this reason that we must traverse the directory in hash > order; so that directory entries are returned once and only once even > in the face of node splits. We try very hard to avoid a hash > collisions, including using a keyed hash which is kept secret to > prevent users from deliberately trying to trigger the failure mode. > > Looking more closely at what we have done, we should be able to do > better on architectures where off_t is 64 bits. Internally, we use a > 64-bit hash, but we only put the high 32 bits of the hash in f_offset > because we can't tell whether the telldir() or telldir64() call might > be used by an application, and we can't tell whether the NFS server is > speaking NFSv2 (where the readdir cookie is only 32 bits) or NFSv3 > (where the readdir cookie is cookie is 64 bits). On platforms where > off_t is 64-bytes, we could store the full 64-bit hash value in > f_offset, but we would swap the low and high 32-bit values, so that 32 > LSB are in the high 32 bits of f_offset and vice versa. > > The NFS v2 server would still get a 64-bit f_offset, but it currently > doesn't do any range checking before dropping it into the 32-bit > cookie, so the high 32 bits would get truncated. This would result in > the same behaviour we have today for those poor benighted souls which > are still using NFSv2, but allow for better behavior for NFSv3 at > least on 64-bit platforms. > Thanks for replying to explain in such details. So there isn't much we can do before we improve nfsd's offset range checking. > So this is something we could do in the future. In practice, no one > has complained about this as far as NFS is concerned, so it's not high > priority for me to pursue. Were you worried about this as a practical > matter, or as a theoretical one? > Oh, I was asking because some guy I work with told me he has seen this once before. He was testing a proprietary Windows NFS v3 client then, using EXT3 as backend filesystem. I do not know it is hash collision that cause the problem until I spent some time reading EXT3 codes. > Regards, > > - Ted -- Zhang Huan -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html