On 05/31/2011 07:26 PM, Andreas Dilger wrote:
On 2011-05-31, at 6:35 AM, Ted Ts'o wrote:
On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
Out of interest, did anyone ever benchmark if dirindex provides any
advantages to readdir? And did those benchmarks include the
disadvantages of the present implementation (non-linear inode
numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
'rm -fr $dir')?
The problem is that seekdir/telldir is terminally broken (and so is
NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
a linear data structure. If you're going to use any kind of
tree-based data structure, a 32-bit "offset" for seekdir/telldir just
doesn't cut it. We actually play games where we memoize the low
32-bits of the hash and keep track of which cookies we hand out via
seekdir/telldir so that things mostly work --- except for NFSv2, where
with the 32-bit cookie, you're just hosed.
The reason why we have to iterate over the directory in hash tree
order is because if we have a leaf node split, half the directories
entries get copied to another directory entry, given the promises made
by seekdir() and telldir() about directory entries appearing exactly
once during a readdir() stream, even if you hold the fd open for weeks
or days, mean that you really have to iterate over things in hash
order.
I'd have to look, since it's been too many years, but as I recall the
problem was that there is a common path for NFSv2 and NFSv3/v4, so we
don't know whether we can hand back a 32-bit cookie or a 64-bit
cookie, so we're always handing the NFS server a 32-bit "offset", even
though ew could do better. Actually, if we had an interface where we
could give you a 128-bit "offset" into the directory, we could
probably eliminate the duplicate cookie problem entirely. We just
send 64-bits worth of hash, plus the first two bytes of the of file
name.
If it's of interest, we've implemented a 64-bit hash mode for ext4 to
solve just this problem for Lustre. The llseek() code will return a
64-bit hash value on 64-bit systems, unless it is running for some
process that needs a 32-bit hash value (only NFSv2, AFAIK).
The attached patch can at least form the basis for being able to return
64-bit hash values for userspace/NFSv3/v4 when usable. The patch
is NOT usable as it stands now, since I've had to modify it from the
version that we are currently using for Lustre (this version hasn't
actually been compiled), but it at least shows the outline of what needs
to be done to get this working. None of the NFS side is implemented.
Thanks Andreas! I haven't tested it yet, but the generic idea looks
good. I guess the lower part of the patch (netfilter stuff) got
accidentally in?
Cheers,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html