On Fri, 2022-02-25 at 15:41 -0500, Trond Myklebust wrote: > On Fri, 2022-02-25 at 15:23 -0500, Benjamin Coddington wrote: > > On 25 Feb 2022, at 10:34, Benjamin Coddington wrote: > > > Ok, so I'm reading that further proof is required, and I'm happy > > > to > > > do > > > the > > > work. Thanks for the replies here and elsewhere. > > > > Here's an example of this problem on a tmpfs export using v8 of > > your > > patchset with the fix to set the change_attr in > > nfs_readdir_page_init_array(). > > > > I'm using tmpfs, because it reliably orders cookies in reverse > > order > > of > > creation (or perhaps sorted by name). > > > > The program drives both the client-side and server-side - so on > > this > > one > > system, /exports/tmpfs is: > > tmpfs /exports/tmpfs tmpfs rw,seclabel,relatime,size=102400k 0 0 > > > > and /mnt/localhost is: > > localhost:/exports/tmpfs /mnt/localhost/tmpfs nfs4 > > rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,pr > > ot > > o=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=n > > on > > e,addr=127.0.0.1 > > 0 0 > > > > The program creates 256 files on the server, walks through them > > once > > on > > the > > client, deletes the last 127 on the server, drops the first page > > from > > the > > pagecache, and walks through them again on the client. > > > > The second listing produces 124 duplicate entries. > > > > I just have to say again: this behavior is _new_ (but not new to > > me), > > and it > > is absolutely going to crop up on our customer's systems that are > > walking > > through millions of directory entries on loaded servers under > > memory > > pressure. The directory listings as a whole become very likely to > > be > > nonsense at random times. I realize they are not /supposed/ to be > > coherent, > > but what we're getting here is going to be far far less coherent, > > and > > its > > going to be a mess. > > > > There are other scenarios that are worse when the cookies aren't > > ordered, > > you can end up with EOF, or get into repeating patterns. > > > > Please compare this with v3, and before this patchset, and tell me > > if > > I'm > > not justified playing chicken little. > > > > Here's what I do to run this: > > > > mount -t tmpfs -osize=100M tmpfs /exports/tmpfs/ > > exportfs -ofsid=0 *:/exports > > exportfs -ofsid=1 *:/exports/tmpfs > > mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost > > ./getdents2 > > > > Compare "Listing 1" with "Listing 2". > > > > I would also do a "rm -f /export/tmpfs/*" between each run. > > > > Thanks again for your time and work. > > > > Ben > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <unistd.h> > > #include <fcntl.h> > > #include <sched.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <sys/syscall.h> > > #include <string.h> > > > > #define NFSDIR "/mnt/localhost/tmpfs" > > #define LOCDIR "/exports/tmpfs" > > #define BUF_SIZE 4096 > > > > int main(int argc, char **argv) > > { > > int i, dir_fd, bpos, total = 0; > > size_t nread; > > struct linux_dirent { > > long d_ino; > > off_t d_off; > > unsigned short d_reclen; > > char d_name[]; > > }; > > struct linux_dirent *d; > > char buf[BUF_SIZE]; > > > > /* create files: */ > > for (i = 0; i < 256; i++) { > > sprintf(buf, LOCDIR "/file_%03d", i); > > close(open(buf, O_CREAT, 666)); > > } > > > > dir_fd = open(NFSDIR, > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC); > > if (dir_fd < 0) { > > perror("cannot open dir"); > > return 1; > > } > > > > while (1) { > > nread = syscall(SYS_getdents, dir_fd, buf, > > BUF_SIZE); > > if (nread == 0 || nread == -1) > > break; > > for (bpos = 0; bpos < nread;) { > > d = (struct linux_dirent *) (buf + bpos); > > printf("%s\n", d->d_name); > > total++; > > bpos += d->d_reclen; > > } > > } > > printf("Listing 1: %d total dirents\n", total); > > > > /* rewind */ > > lseek(dir_fd, 0, SEEK_SET); > > > > /* drop the first page */ > > posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED); > > > > /* delete the last 127 files: */ > > for (i = 127; i < 256; i++) { > > sprintf(buf, LOCDIR "/file_%03d", i); > > unlink(buf); > > } > > > > total = 0; > > while (1) { > > nread = syscall(SYS_getdents, dir_fd, buf, > > BUF_SIZE); > > if (nread == 0 || nread == -1) > > break; > > for (bpos = 0; bpos < nread;) { > > d = (struct linux_dirent *) (buf + bpos); > > printf("%s\n", d->d_name); > > total++; > > bpos += d->d_reclen; > > } > > } > > printf("Listing 2: %d total dirents\n", total); > > > > close(dir_fd); > > return 0; > > } > > > tmpfs is broken on the server. It doesn't provide stable cookies, and > knfsd doesn't use the verifier to tell you that the cookie assignment > has changed. > > > Re-export of tmpfs has never worked reliably. What I mean is that tmpfs is always a poor choice for NFS because seekdir()/telldir() don't work reliably, and so READDIR cannot work reliably, since it relies on open()+seekdir() to continue reading the directory in successive RPC calls. Anyhow, to get back to your question about whether we should or should not be detecting that the directory changed when you delete the files on the server. The answer is no... Nothing in the above guarantees that the cache is revalidated. NFS close to open cache consistency means that we guarantee to revalidate the cached data on open(), and only then. That guarantee does not extend to lseek() or to the rewinddir/seekdir wrappers. If your application wants stronger cache consistency, then there are tricks to enable that. Now that statx() has the AT_STATX_FORCE_SYNC flag, you could use that to force a revalidation of the directory attributes on the client. You might also use the posix_fadvise() trick to try to clear the cache. However note that none of these tricks are guaranteed to work. They're not reliable now, and that situation is unlikely to change in the future barring a deliberate (and documented!) change in kernel policy. So as of now, the only way to reliably introduce a revalidation point in your testcase above is to close() and then open(). -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx