On Wed, 2023-05-03 at 00:43 +0000, Chuck Lever III wrote: > > > On May 2, 2023, at 8:12 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Mon, 17 Apr 2023 15:23:10 -0400 Chuck Lever <cel@xxxxxxxxxx> wrote: > > > > > From: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > > > > The current cursor-based directory cookie mechanism doesn't work > > > when a tmpfs filesystem is exported via NFS. This is because NFS > > > clients do not open directories: each READDIR operation has to open > > > the directory on the server, read it, then close it. The cursor > > > state for that directory, being associated strictly with the opened > > > struct file, is then discarded. > > > > > > Directory cookies are cached not only by NFS clients, but also by > > > user space libraries on those clients. Essentially there is no way > > > to invalidate those caches when directory offsets have changed on > > > an NFS server after the offset-to-dentry mapping changes. > > > > > > The solution we've come up with is to make the directory cookie for > > > each file in a tmpfs filesystem stable for the life of the directory > > > entry it represents. > > > > > > Add a per-directory xarray. shmem_readdir() uses this to map each > > > directory offset (an loff_t integer) to the memory address of a > > > struct dentry. > > > > > > > How have people survived for this long with this problem? > > It's less of a problem without NFS in the picture; local > applications can hold the directory open, and that preserves > the seek cursor. But you can still trigger it. > > Also, a plurality of applications are well-behaved in this > regard. It's just the more complex and more useful ones > (like git) that seem to trigger issues. > > It became less bearable for NFS because of a recent change > on the Linux NFS client to optimize directory read behavior: > > 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner") > > Trond argued that tmpfs directory cookie behavior has always > been problematic (eg broken) therefore this commit does not > count as a regression. However, it does make tmpfs exports > less usable, breaking some tests that have always worked. > > > > It's a lot of new code - > > I don't feel that this is a lot of new code: > > include/linux/shmem_fs.h | 2 > mm/shmem.c | 213 +++++++++++++++++++++++++++++++++++++++++++--- > 2 files changed, 201 insertions(+), 14 deletions(-) > > But I agree it might look a little daunting on first review. > I am happy to try to break this single patch up or consider > other approaches. > I wonder whether you really need an xarray here? dcache_readdir walks the d_subdirs list. We add things to d_subdirs at d_alloc time (and in d_move). If you were to assign its dirindex when the dentry gets added to d_subdirs (maybe in ->d_init?) then you'd have a list already ordered by index, and could deal with missing indexes easily. It's not as efficient as the xarray if you have to seek through a big dir, but if keeping the changes tiny is a goal then that might be another way to do this. > We could, for instance, tuck a little more of this into > lib/fs. Copying the readdir and directory seeking > implementation from simplefs to tmpfs is one reason > the insertion count is worrisome. > > > > can we get away with simply disallowing > > exports of tmpfs? > > I think the bottom line is that you /can/ trigger this > behavior without NFS, just not as quickly. The threshold > is high enough that most use cases aren't bothered by > this right now. > > We'd rather not disallow exporting tmpfs. It's a very > good testing platform for us, and disallowing it would > be a noticeable regression for some folks. > > Yeah, I'd not be in favor of that either. We've had an exportable tmpfs for a long time. It's a good way to do testing of the entire NFS server stack, without having to deal with underlying storage. > > How can we maintain this? Is it possible to come up with a test > > harness for inclusion in kernel selftests? > > There is very little directory cookie testing that I know of > in the obvious place: fstests. That would be where this stuff > should be unit tested, IMO. > I'd like to see this too. It's easy for programs to get this wrong. In this case, could we emulate the NFS behavior by doing this in a loop over a large directory? opendir seekdir (to result of last telldir) readdir unlink telldir closedir At the end of it, check whether there are any entries left over. -- Jeff Layton <jlayton@xxxxxxxxxx>