I was asked to figure out why the listing of very large directories was
slow. More specifically, why concurrently listing the same large
directory
is /very/ slow. It seems that sometimes a user's reaction to waiting
for
'ls' to complete is to start a few more.. and then their machine takes a
very long time to complete that work.
I can reproduce that finding. As an example:
time ls -fl /dir/with/200000/entries/ >/dev/null
real 0m10.766s
user 0m0.716s
sys 0m0.827s
But..
for i in {1..10}; do time ls -fl /dir/with/200000/entries/ >/dev/null &
done
Each of these ^^ 'ls' commands will take 4 to 5 minutes to complete.
The problem is that concurrent 'ls' commands stack up in nfs_readdir()
both
waiting on the next page and taking turns filling the next page with
xdr,
but only one of them will have desc->plus set because setting it clears
the
flag on the directory. So if a page is filled by a process that doesn't
have
desc->plus then the next pass through lookup(), it dumps the entire page
cache with nfs_force_use_readdirplus(). Then the next readdir starts
all
over filling the pagecache. Forward progress happens, but only after
many
steps back re-filling the pagecache.
To me most obvious fix would be to serialize nfs_readdir() on the
directory
inode, so I'll follow-up with patch that does that with nfsi->rwsem.
With that,
the above parallel 'ls' takes 12 seconds for each 'ls' to complete.
This only works because with concurrent 'ls' there is a consistent
buffer
size so a waiting nfs_readdir() started in the same place for an
unmodified
directory should always hit the cache after waiting. Serializing
nfs_readdir() will not solve this problem for concurrent callers with
differing buffer sizes, or starting at different offsets, since there's
a
good chance the waiting readdir() will not see the readdirplus flag when
it
resumes and so will not prime the dcache.
While I think it's an OK fix, it feels bad to serialize. At the same
time, nfs_readdir() is already serialized on the pagecache when
concurrent
callers need to go to the server. There might be other problems I
haven't
thought about.
Maybe there's another way to fix this, or maybe we can just say "Don't
do ls
more than once, you impatient bastards!"
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html