On Sat, Sep 14, 2019 at 03:57:15PM -0700, Linus Torvalds wrote: > But it is also possible that we could avoid some of that O(n**2) > behavior by simply not adding the corsor to the end of the dentry > child list at all. Right now your patch *always* sets the cursor at a > valid point - even if it's at the end of the directory. But we could > skip the "end of the directory" case entirely and just set a flag in > the file for "at eof" instead. Yeah. > That way the cursor at least wouldn't exist for the common cases when > we return to user space (at the beginning of the readdir and at the > end). Which might make the cursors simply not be quite as common, even > when you have a _lot_ of concurrent readdir() users. > > There may be other similar things we could do to minimize the pressure > on the parent dentry lock. For example, maybe we could insert a cursor > only _between_ readdir() calls, but then in the readdir() loop itself, > we know we hold the inode lock shared for the directory, so during > _that_ loop we can use the existing positive dentries that we keep a > refcount to as cursors. > > Because if the only reason to not use existing dentry pointers as > cursors is the concurrent rename() issue, then _that_ is certainly > something that the parent inode shared lock should protect against, > even if the child dentry list can change in other ways). > > So if the main 'reaim' problem was that the dentry child list itself > grew due to the cursor pressure (and that is likely) and that in turn > then caused the O(n**2) bad locking behavior due to having to walk > much longer dentry child chains, then I suspect that we can do a few > tweaks to simply not make that happen in practice. > > Yes, I realize that I'm handwaving a bit on the two above suggestions, > but don't you think that sounds doable? I think I have a stronger solution. It includes the "cursors past the EOF are marked", but that's not all. Look: the obvious problem with adding an hlist of cursors anchored in dentry is that we don't want to blow dentry size. OK, but... we have d_subdirs/d_child. And the same crawl through the tree has shown exactly one place where we do anything to the end of the list - right in dcache_readdir(). So we could bloody well turn that thing into hlist. Voila - a pointer saved. So, fuck using struct dentry for cursors. What we need in there is * pointer to struct dentry we would be about to read * hlist_node linking them together * indicator of pointing before / after the entire directory contents. * some way to get to dentry of directory and/or struct file. Which can be done in much smaller space than struct dentry; details of representation really don't matter. d_subdirs/d_child become an hlist_head/hlist_node list; no cursors in there at any time. d_cursors is a new hlist_head, anchoring the set of cursor that point to this sucker. The list is protected by ->d_lock of that dentry. d_cursors being non-empty contributes 1 to d_count. dcache_readdir()/dcache_dir_lseek() have exclusion with simple_rename() on parent's inode. On per-cursor basis they have exclusion with each other (already; ->f_lock_pos). dcache_dir_close() locks directory shared, giving it exclusion with simple_rename(); per-cursor exclusion is, of course, already there. Under that rwsem it removes the cursor from the hlist it's on (if any), using ->d_lock of whatever it's point on for protection. If that was the last cursor in hlist, drop the target after removal. In any case, cursor gets freed. In simple_rename(): if there are cursors grab the next sibling (if any) take ->d_lock on source rip the list out of it drop ->d_lock on source if there's no sibling go through the list point cursors post-EOF, dissolving hlist else go through the list point cursors to the sibling find the last element of the list, while we are at it if the sibling's list hadn't been empty to start with drop our reference to sibling take ->d_lock on the sibling splice the list in drop ->d_lock on sibling drop the reference to source (from now-empty ->d_cursors) Note that it's O(1) under any ->d_lock and O(cursors moved) without holding any ->d_lock. walk_cursor(cursor, count): while count && cursor is not post-EOF drop_old = NULL if cursor points to anything take ->d_lock on the target remove cursor from hlist if target's ->d_cursor is now empty drop_old = target drop ->d_lock on the target p = target's ->d_child else p = parent's ->d_subdirs take ->d_lock on directory drop_new = NULL eof = true for d in list, starting after p if d is positive && !--count || need_resched eof = false drop_new = dget(d) point cursor to d take ->d_lock on d if its ->d_cursors is empty drop_new = NULL insert cursor into its ->d_cursors drop ->d_lock on d break drop ->d_lock on directory dput(drop_new) dput(drop_old) if eof mark cursor post-EOF else if count cond_resched() dcache_dir_lseek(), "changing position" case: walk_cursor(cursor, where) dcache_readdir() loop: while cursor not post-EOF if dir_emit the cursor's target fails break walk_cursor(cursor, 1) ctx->pos++ dentry_unlink(): none of the cursor shit in the list, TYVM, and we can't be called with cursors attached - d_count would've been positive. What it should, AFAICS, give: * no loops under ->d_lock in dentry_unlist() * no cursors polluting the lists * dentry size unchanged * cursors considerably smaller than now * no looping for hell knows how long under ->d_lock And it might be possible to be clever enough to get lockless walk_cursor() (lockless in terms of directory ->d_lock, that is), but even without that this scheme looks interesting, IMO. I haven't even started trying to implement that, so I might very well have missed obvious problems. Comments?