On Wed 10-01-24 16:41:46, Amir Goldstein wrote: > On Wed, Jan 10, 2024 at 3:56 PM Jan Kara <jack@xxxxxxx> wrote: > > On Tue 09-01-24 21:48:18, Amir Goldstein wrote: > > > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > > optimized the case where there are no fsnotify watchers on any of the > > > filesystem's objects. > > > > > > It is quite common for a system to have a single local filesystem and > > > it is quite common for the system to have some inotify watches on some > > > config files or directories, so the optimization of no marks at all is > > > often not in effect. > > > > I agree. > > > > > Access event watchers are far less common, so optimizing the case of > > > no marks with access events could improve performance for more systems, > > > especially for the performance sensitive hot io path. > > > > > > Maintain a per-sb counter of objects that have marks with access > > > events in their mask and use that counter to optimize out the call to > > > fsnotify() in fsnotify access hooks. > > > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > > I'm not saying no to this but let's discuss first before hacking in some > > partial solution :). AFAIU what Jens sees is something similar as was > > reported for example here [1]. In these cases we go through > > fsnotify_parent() down to fsnotify() until the check: > > > > if (!(test_mask & marks_mask)) > > return 0; > > > > there. And this is all relatively cheap (pure fetches like > > sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks, > > sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except > > for parent handling in __fsnotify_parent(). That requires multiple atomic > > operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget() > > is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this > > gets really expensive, even more so if multiple threads race for the same > > parent dentry. > > Sorry, I forgot to link the Jens regression report [1] which included this > partial perf report: > > 3.36% [kernel.vmlinux] [k] fsnotify > 2.32% [kernel.vmlinux] [k] __fsnotify_paren > > Your analysis about __fsnotify_parent() may be correct, but what would be > the explanation to time spent in fsnotify() in this report? OK, I don't have a good explanation for why fsnotify() itself is so high in Jens' profile. Maybe cacheline fetches caused by inode->i_fsnotify_mask and inode->i_fsnotify_marks checks are causing the overhead but it would be good to have it confirmed. > In general, I think that previous optimization work as this commit by Mel: > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead when > there is no watcher") showed improvements when checks were inlined. Well, I believe that commit helped exactly because the check for DCACHE_FSNOTIFY_PARENT_WATCHED was moved ahead of all the work now in __fsnotify_parent(). > [1] https://lore.kernel.org/linux-fsdevel/53682ece-f0e7-48de-9a1c-879ee34b0449@xxxxxxxxx/ > > > So I think what we ideally need is a way to avoid this expensive "stabilize > > parent & its name" game unless we are pretty sure we are going to generate > > the event. There's no way to avoid fetching the parent early if > > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the > > take_dentry_name_snapshot() cost though). However for the case where > > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case > > here AFAICT) we need the parent only after the check of 'test_mask & > > marks_mask'. So in that case we could completely avoid the extra cost of > > parent dentry handling. > > > > So in principle I believe we could make fsnotify noticeably lighter for the > > case where no event is generated. Just the question is how to refactor the > > current set of functions to achieve this without creating an unmaintainable > > mess. I suspect if we lifted creation & some prefilling of iter_info into > > __fsnotify_parent() and then fill in the parent in case we need it for > > reporting only in fsnotify(), the code could be reasonably readable. We'd > > need to always propagate the dentry down to fsnotify() though, currently we > > often propagate only the inode because the dentry may be (still in case of > > create or already in case of unlink) negative. Similarly we'd somehow need > > to communicate down into fsnotify() whether the passed name needs > > snapshotting or not... > > > > What do you think? > > I am not saying no ;) > but it sound a bit complicated so if the goal is to reduce the overhead > of fsnotify_access() and fsnotify_perm(), which I don't think any application > cares about, then I'd rather go with a much simpler solution even if it > does not cover all the corner cases. OK, let's figure out what exactly causes slowdown in Jens' case first. I agree your solution helps mitigate the cost of fsnotify_access() for reads but I forsee people complaining about fsnotify_modify() cost for writes in short order :) and there it is not so simple to solve as there's likely some watch for FS_MODIFY event somewhere. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR