Re: readdir vs. getattr

NeilBrown <neilb@xxxxxxx> · Thu, 6 Feb 2014 13:45:39 +1100

On Wed, 29 Jan 2014 07:18:41 -0500 Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
wrote:

> On Wed, 29 Jan 2014 10:21:43 +0100 (CET)
> "Mkrtchyan, Tigran" <tigran.mkrtchyan@xxxxxxx> wrote:
> 
> > (without knowing kernel details)
> > 
> > What about some stat counter and trigger readdirplus instead of next getattr if
> > some threshold are reached. Let say directory has more than 8000 entries and 
> > getattr was called on 10% of the files. Of course you have to watch this per
> > process.
> > 
> > Tigran.
> > 
> 
> That might be the only possible solution, but as always with these
> sorts of heuristics, it's bound to help some workloads and hurt others.
> 
> I'm not sure you'd need to watch that per-process. The pagecache data is
> shared after all. The problem though is that you'd need to keep track
> of which (or how many different) entries in the dir have gotten a
> GETATTR since the last READDIRPLUS.
> 
> IOW, if someone is repeatedly hammering a single entry in a directory
> with stat() calls that need the atime, you probably don't want to force
> a READDIRPLUS on the next readdir() call. If however, they're stat'ing
> each entry in the directory in turn for a "ls -l" type workload then
> you probably do want to dump the cache and just go for a READDIRPLUS.
> 
> The real way to fix this is probably for someone to pick up and drive
> the xstat()/readdirplus() work. If 'ls -l' used a readdirplus() syscall
> to do its bidding then you'd have a pretty good idea of what you'd need
> to do. That won't help in the short term though...

I agree that improved syscalls are probably the only way to get perfect
solution.  So I tried thinking of an imperfect solution that is still a net
positive.

The customer issue that started me looking at this involved a directory with
1 million files.  I don't know if this is a genuine use case or just making
sure that it all scales nicely, but it is a case where a partial fix could
make a big difference.

"ls -l" first doest a "getdents" with a 32K buffer, then stats all those
files, then does another getdents.
With 2000 file, that is only one getdents call.  With 1,000,000 it would be
hundreds.
It is not possible to know that the first getdents call should force a
readdirplus.  It is quite easy to know that the second and later ones should.
This approach will make no difference to the 2000 file case, but will restore
better than 99% of performance in the 1,000,000 file case.

So I'm suggesting the following patch.  I don't really understand the cache
rules fully so it might be more complicated than needed, and it might even be
wrong.  But it seems to work on some simple testing.

The change to nfs_update_inode fixes an issue that had me stumped for a
while.  It was still sending lots of GETATTR requests even after it had
switched to READDIRPLUS instead of using cached info.  So that might be a
genuine bug that should be fixed independently of this patch.

The idea is that the first time we find that we look up a file in a directory
and will need to do a GETATTR, we flush the cached data in the directory so
READDIRPLUS will be used in future.

Comments very welcome.

Thanks,
NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index be38b573495a..b237fd7f2e0e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -845,9 +845,12 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->dir_cookie = &dir_ctx->dir_cookie;
 	desc->decode = NFS_PROTO(inode)->decode_dirent;
 	desc->plus = nfs_use_readdirplus(inode, ctx) ? 1 : 0;
+	if (desc->plus && ctx->pos == 0)
+		clear_bit(NFS_INO_DID_FLUSH, &NFS_I(inode)->flags);
 
 	nfs_block_sillyrename(dentry);
-	if (ctx->pos == 0 || nfs_attribute_cache_expired(inode))
+	if (ctx->pos == 0 || nfs_attribute_cache_expired(inode) ||
+	    (NFS_I(inode)->cache_validity & NFS_INO_INVALID_DATA))
 		res = nfs_revalidate_mapping(inode, file->f_mapping);
 	if (res < 0)
 		goto out;
@@ -1080,6 +1083,16 @@ static int nfs_lookup_revalidate(struct dentry *dentry, unsigned int flags)
 
 	/* Force a full look up iff the parent directory has changed */
 	if (!nfs_is_exclusive_create(dir, flags) && nfs_check_verifier(dir, dentry)) {
+		if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)
+		    && ((NFS_I(inode)->cache_validity &
+			 (NFS_INO_INVALID_ATTR|NFS_INO_INVALID_LABEL))
+			|| nfs_attribute_cache_expired(inode))
+		    && !test_and_set_bit(NFS_INO_DID_FLUSH, &NFS_I(dir)->flags)
+			) {
+			nfs_advise_use_readdirplus(dir);
+			goto out_zap_parent;
+		}
+
 		if (nfs_lookup_verify_inode(inode, flags))
 			goto out_zap_parent;
 		goto out_valid;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 0ae5807480f4..c282ce3e5349 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -218,6 +218,7 @@ struct nfs_inode {
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */
 #define NFS_INO_LAYOUTCOMMIT	(9)		/* layoutcommit required */
 #define NFS_INO_LAYOUTCOMMITTING (10)		/* layoutcommit inflight */
+#define NFS_INO_DID_FLUSH	(11)
 
 static inline struct nfs_inode *NFS_I(const struct inode *inode)
 {
Attachment:
signature.asc

Description: PGP signature