On Wed, 21 Sep 2011 15:30:12 -0400 Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> wrote: > On Wed, 2011-09-21 at 15:10 -0400, Jeff Layton wrote: > > On Wed, 21 Sep 2011 14:53:12 -0400 > > Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> wrote: > > > > > On Wed, 2011-09-21 at 11:58 -0400, Jeff Layton wrote: > > > > We had a regression reported against RHEL concerning the opening of > > > > directories and it looks like that same problem is in current mainline > > > > code too. If you do the following on a directory that is not yet in the > > > > dcache you get an EISDIR error: > > > > > > > > open("/mnt/nfs/dir1", O_RDONLY) = -1 EISDIR (Is a directory) > > > > > > > > If however, you stat the directory first, the open works. The > > > > difference seems to be that in the first case we're going through the > > > > lookup codepath, and in the second we go through d_revalidate. > > > > > > > > In the first case, we send an OPEN call to the server and it responds > > > > with NFS4ERR_ISDIR. That gets translated to -EISDIR, and returned to > > > > userspace. It wasn't always this way though, and I think the regression > > > > was introduced in commit d953126a2. > > > > > > > > That patch was added to fix an oops due to a buggy server, and I'm > > > > unclear on how best to fix this. It seems like we need to allow the > > > > server to fall back to doing a normal lookup when we get -EISDIR on the > > > > OPEN call, but how do we ensure that we don't end up with the same oops > > > > from that server bug? > > > > > > How about returning an error if we get to the file->f_ops->open on a > > > regular file in NFSv4? > > > > > > > That would probably be reasonable. I'll see if I can come up with a > > patch. The tricky part of course is ensuring that nothing regresses... > > > > I think this is probably safe for the most part. The d_revalidate > > codepath has always allowed you to end up with an open context with > > NULL state. > > > > Granted the buggy server case here is exceedingly rare, but it seems > > like the code already assumes that a ctx reached via filp may have a > > NULL state pointer. > > I agree that the buggy server is rare, but you can potentially reproduce > the problem using something like the following script > > mkdir b; touch a; while true do mv a c; mv b a; mv c b; done > > It will probably mostly either succeed or fail with ENOENT, but every > now and then it should be possible to tickle the above issue. > Ok, I sent you a patch that fixes the bug. I ran the above on the server and a program in a loop that did opens on the client, but was never able to reproduce the server-side bug. It seemed to be OK in other testing though. -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html