> On Nov 22, 2024, at 7:57 AM, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Thu, Nov 21, 2024 at 02:54:16PM +0000, Chuck Lever III wrote: >> >> >>> On Nov 21, 2024, at 3:34 AM, Christian Brauner <brauner@xxxxxxxxxx> wrote: >>> >>> On Wed, Nov 20, 2024 at 10:05:42AM -0500, Chuck Lever wrote: >>>> On Wed, Nov 20, 2024 at 09:59:54AM +0100, Christian Brauner wrote: >>>>> On Mon, Nov 18, 2024 at 03:58:09PM -0500, Chuck Lever wrote: >>>>>> >>>>>> I've been trying to imagine a solution that does not depend on the >>>>>> range of an integer value and has solidly deterministic behavior in >>>>>> the face of multiple child entry creations during one timer tick. >>>>>> >>>>>> We could partially re-use the legacy cursor/list mechanism. >>>>>> >>>>>> * When a child entry is created, it is added at the end of the >>>>>> parent's d_children list. >>>>>> * When a child entry is unlinked, it is removed from the parent's >>>>>> d_children list. >>>>>> >>>>>> This includes creation and removal of entries due to a rename. >>>>>> >>>>>> >>>>>> * When a directory is opened, mark the current end of the d_children >>>>>> list with a cursor dentry. New entries would then be added to this >>>>>> directory following this cursor dentry in the directory's >>>>>> d_children list. >>>>>> * When a directory is closed, its cursor dentry is removed from the >>>>>> d_children list and freed. >>>>>> >>>>>> Each cursor dentry would need to refer to an opendir instance >>>>>> (using, say, a pointer to the "struct file" for that open) so that >>>>>> multiple cursors in the same directory can reside in its d_chilren >>>>>> list and won't interfere with each other. Re-use the cursor dentry's >>>>>> d_fsdata field for that. >>>>>> >>>>>> >>>>>> * offset_readdir gets its starting entry using the mtree/xarray to >>>>>> map ctx->pos to a dentry. >>>>>> * offset_readdir continues iterating by following the .next pointer >>>>>> in the current dentry's d_child field. >>>>>> * offset_readdir returns EOD when it hits the cursor dentry matching >>>>>> this opendir instance. >>>>>> >>>>>> >>>>>> I think all of these operations could be O(1), but it might require >>>>>> some additional locking. >>>>> >>>>> This would be a bigger refactor of the whole stable offset logic. So >>>>> even if we end up doing that I think we should use the jiffies solution >>>>> for now. >>>> >>>> How should I mark patches so they can be posted for discussion and >>>> never applied? This series is marked RFC. >>> >>> There's no reason to not have it tested. >> >> 1/2 is reasonable to apply. >> >> 2/2 is work-in-progress. So, fair enough, if you are applying >> for testing purposes. >> >> >>> Generally I don't apply RFCs >>> but this code has caused various issues over multiple kernel releases so >>> I like to test this early. >> >> The biggest problem has been that I haven't found an >> authoritative and comprehensive explanation of how >> readdir / getdents needs to behave around renames. >> >> The previous cursor-based solution has always been a "best >> effort" implementation, and most of the other file systems >> have interesting gaps and heuristics in this area. It's >> difficult to piece all of these together to get the idealized >> design requirements, and also a get a sense of where >> compromises can be made. >> >> Any advice/guidance is welcome. > > I didn't mean to make it sound like you did anything wrong or I was > blaming you. I was literally just trying to say we had weird behavior in > this area for legitimate reasons. Posix states this: > > If posix_getdents() is called concurrently with an operation that > adds, deletes, or modifies a directory entry, the results from > posix_getdents() shall reflect either all of the effects of the > concurrent operation or none of them. If a sequence of calls to > posix_getdents() is made that reads from offset zero to end-of-file > and a file is removed from or added to the directory between the > first and last of those calls, whether the sequence of calls returns > an entry for that file is unspecified. > > Which to me all reads like we're pretty much free in what to do as long > as we clearly document it. Agreed, it's that documentation that I'm sorely missing. >>>> I am actually half-way through implementing the approach described >>>> here. It is not as big a re-write as you might think, and addresses >>>> some fundamental misunderstandings in the offset_iterate_dir() code. >>> >>> Ok, great then let's see it. >> >> I'm finishing it up now. Unfortunately I have several other >> (NFSD and not) bugs I'm working through. > > No hurry. What is making this difficult or maybe even impossible is commit da549bdd15c2 ("dentry: switch the lists of children to hlist"). When I last worked on libfs, d_subdirs was a normal list_head. Now, as an hlist, d_children has no tail pointer. It's no longer easy or fast to walk the list of children from oldest to newest, so I'll have to find a solution that continues to use the mtree for iterating through directories. -- Chuck Lever