Re: [PATCH v1] shmem: stable directory cookies

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 04 May 2023 16:21:24 -0400

On 2 May 2023, at 20:43, Chuck Lever III wrote:

>> On May 2, 2023, at 8:12 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On Mon, 17 Apr 2023 15:23:10 -0400 Chuck Lever <cel@xxxxxxxxxx> wrote:
>>
>>> From: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>
>>> The current cursor-based directory cookie mechanism doesn't work
>>> when a tmpfs filesystem is exported via NFS. This is because NFS
>>> clients do not open directories: each READDIR operation has to open
>>> the directory on the server, read it, then close it. The cursor
>>> state for that directory, being associated strictly with the opened
>>> struct file, is then discarded.
>>>
>>> Directory cookies are cached not only by NFS clients, but also by
>>> user space libraries on those clients. Essentially there is no way
>>> to invalidate those caches when directory offsets have changed on
>>> an NFS server after the offset-to-dentry mapping changes.
>>>
>>> The solution we've come up with is to make the directory cookie for
>>> each file in a tmpfs filesystem stable for the life of the directory
>>> entry it represents.
>>>
>>> Add a per-directory xarray. shmem_readdir() uses this to map each
>>> directory offset (an loff_t integer) to the memory address of a
>>> struct dentry.
>>>
>>
>> How have people survived for this long with this problem?

They survived this long by not considering their current directory offset to
be a stationary position in the stream after removing chunks of that stream,
as per some POSIX.  However, git does this:

opendir
while getdents
    unlink(dentries)
closedir
assert(directory empty)

This pattern isn't guaranteed to always produce an empty directory, and
filesystems aren't wrong when it doesn't, but they could probably do better.

Libfs, on the other hand, conservatively closes and re-opens the directory
after removing some entries in order to ensure none are skipped.

> It's less of a problem without NFS in the picture; local
> applications can hold the directory open, and that preserves
> the seek cursor. But you can still trigger it.
>
> Also, a plurality of applications are well-behaved in this
> regard. It's just the more complex and more useful ones
> (like git) that seem to trigger issues.
>
> It became less bearable for NFS because of a recent change
> on the Linux NFS client to optimize directory read behavior:
>
> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")

My ears burn again.

> Trond argued that tmpfs directory cookie behavior has always
> been problematic (eg broken) therefore this commit does not
> count as a regression. However, it does make tmpfs exports
> less usable, breaking some tests that have always worked.

As luck would have it, since on NFS the breakage also depends on the length
of the filenames.

It's also possible to fix git's remove_dir_recurse(), but making tmpfs have
stable directory offsets would be an improvement for everyone, and especially
for NFS.

>> It's a lot of new code -
>
> I don't feel that this is a lot of new code:
>
> include/linux/shmem_fs.h |    2
> mm/shmem.c               |  213 +++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 201 insertions(+), 14 deletions(-)
>
> But I agree it might look a little daunting on first review.
> I am happy to try to break this single patch up or consider
> other approaches.
>
> We could, for instance, tuck a little more of this into
> lib/fs. Copying the readdir and directory seeking
> implementation from simplefs to tmpfs is one reason
> the insertion count is worrisome.
>
>
>> can we get away with simply disallowing
>> exports of tmpfs?
>
> I think the bottom line is that you /can/ trigger this
> behavior without NFS, just not as quickly. The threshold
> is high enough that most use cases aren't bothered by
> this right now.

Yes, you can run into this problem directly on tmpfs.

> We'd rather not disallow exporting tmpfs. It's a very
> good testing platform for us, and disallowing it would
> be a noticeable regression for some folks.
>
>
>> How can we maintain this?  Is it possible to come up with a test
>> harness for inclusion in kernel selftests?
>
> There is very little directory cookie testing that I know of
> in the obvious place: fstests. That would be where this stuff
> should be unit tested, IMO.

Yes, we could write a test, but a test failure shouldn't mean the
filesystem is wrong or broken.

Ben