Oh, Al, I saw you mentioned the reoder issue at https://groups.google.com/g/syzkaller/c/0SW33jMcrXQ/m/lHJUsWHVBwAJ and with a follow-up patch https://gist.githubusercontent.com/dvyukov/67fe363d5ce2e2b06c71/raw/4d1b6c23f8dff7e0f8e2e3cab7e50208fddb0570/gistfile1.txt However, it looks it would not work in some previous version https://github.com/torvalds/linux/blob/v4.16/fs/dcache.c#L496 Because we set `d_hahs.prev` to `NULL` at ``` void __d_drop(struct dentry *dentry) { ___d_drop(dentry); dentry->d_hash.pprev = NULL; } ``` then in `dentry_unlink_inode`, `raw_write_seqcount_begin` would be skipped due to `if (hashed)` condition is false. ``` static void dentry_unlink_inode(struct dentry * dentry) __releases(dentry->d_lock) __releases(dentry->d_inode->i_lock) { struct inode *inode = dentry->d_inode; bool hashed = !d_unhashed(dentry); if (hashed) raw_write_seqcount_begin(&dentry->d_seq); <--- Looks would skip because hashed is false. __d_clear_type_and_inode(dentry); hlist_del_init(&dentry->d_u.d_alias); if (hashed) raw_write_seqcount_end(&dentry->d_seq); <--- Looks would skip because hashed is false. ... ``` Should we backport https://github.com/torvalds/linux/commit/4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab and https://github.com/torvalds/linux/commit/0632a9ac7bc0a32f8251a53b3925775f0a7c4da6 to previous versions? On Mon, May 3, 2021 at 11:31 PM haosdent <haosdent@xxxxxxxxx> wrote: > > Hi, Alexander, thanks a lot for your detailed explanations. I take a > look at the code again and the thread and I realize there are some > incorrect representations in my previous word. > > >> struct inode *d_inode = 0x0 <======= d_inode is NULL and cause Oops! > > Actually it is Oops at `inode->i_flags` directly instead of `d_inode`, > so the code would not further to change the access time. > > ``` > bool __atime_needs_update(const struct path *path, struct inode *inode, > bool rcu) > { > struct vfsmount *mnt = path->mnt; > struct timespec now; > > if (inode->i_flags & S_NOATIME) <======= Oops at here according to > `[19521409.372016] IP: __atime_needs_update+0x5/0x190`. > return false; > ``` > > But it looks impossible if we take a look at "walk_component > lookup_fast". > > Let me introduce why it goes through "walk_component > lookup_fast" > instead of "walk_component > lookup_slow" first, > > in "walk_component > step_into > pick_link", the code would set > `nameidata->stack->seq` to `seq` which is comes from the passed in > parameters. > If the code goes through "walk_component > lookup_slow", `seq` would > be 0 and then `nameidata->stack->seq` would be 0. > If the code goes through "walk_component > lookup_slow", `seq` would > be `dentry->d_seq->sequence`. > According to the contents in attachment files "nameidata.txt" and > "dentry.txt", `dentry->d_seq->sequence` is 4, and > `nameidata->stack->seq` is 4 as well. > So looks like the code goes through "walk_component > lookup_fast" and > "walk_component > step_into > pick_link". > > The `inode` parameter in `__atime_needs_update` comes from > `nameidata->link_inode`. But in attachment file "nameidata.txt", we > could found `nameidata->link_inode` is NULL already. > Because the code goes through "walk_component > lookup_fast" and > "walk_component > step_into > pick_link", the `inode` assign to > `nameidata->link_inode` must comes from `lookup_fast`. > > So it looks like something wrong in `lookup_fast`. Let me continue to > explain why this looks impossible. > > In `walk_component`, `lookup_fast` have to return 1 (> 0), otherwise > it would fallback to `lookup_slow`. > > ``` > err = lookup_fast(nd, &path, &inode, &seq); > if (unlikely(err <= 0)) { > if (err < 0) > return err; > path.dentry = lookup_slow(&nd->last, nd->path.dentry, > nd->flags); > } > > return step_into > ``` > > Because for our case, it looks like the code goes through > "walk_component > lookup_fast" and "walk_component > step_into > > pick_link", > This infers `lookup_fast` return 1 in this Oops. > > Because `lookup_fast` return 1, it looks like the code goes through > the following path. > > ``` > if (nd->flags & LOOKUP_RCU) { > ... > > *inode = d_backing_inode(dentry); > negative = d_is_negative(dentry); > > ... > ... > if (negative) > return -ENOENT; > path->mnt = mnt; > path->dentry = dentry; > if (likely(__follow_mount_rcu(nd, path, inode, seqp))) > return 1; > ``` > > As we see, if `*inode` is NULL, it should return `-ENOENT` because `if > (negative)` would be true, which is conflict with "`lookup_fast` > return 1". > > And in `__d_clear_type_and_inode`, it always sets the dentry to > negative first and then sets d_inode to NULL. > > ``` > static inline void __d_clear_type_and_inode(struct dentry *dentry) > {` > unsigned flags = READ_ONCE(dentry->d_flags); > > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); // Set dentry to > negative first. > WRITE_ONCE(dentry->d_flags, flags); > // memory barrier > dentry->d_inode = NULL; > // Then set d_inode to NULL. > } > ``` > > So looks like `inode` in `lookup_fast` should not be NULL if it could > skip `if (negative)` check even in the RCU case. Unless > > ``` > # in lookup_fast method > *inode = d_backing_inode(dentry); > negative = d_is_negative(dentry); > ``` > > is reorder to > > ``` > # in lookup_fast method > negative = d_is_negative(dentry); > *inode = d_backing_inode(dentry); > ``` > > when CPU executing the code. But is this possible in RCU? > > I diff my local ubuntu's code with upstream tag v4.15.18, it looks > like no different in `fs/namei.c`, `fs/dcache.c`, `fs/proc`. > So possible the problem may happen to upstream tag v4.15.18 as well, > sadly my script still could not reproduce the issue on the server so > far, > would like to see if any insights from you then I could continue to > check what's wrong in this Oops, thank you in advance! > > On Tue, Apr 27, 2021 at 1:30 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > On Tue, Apr 27, 2021 at 01:16:44AM +0800, haosdent wrote: > > > > really should not assume ->d_inode stable > > > > > > Hi, Alexander, sorry to disturb you again. Today I try to check what > > > `dentry->d_inode` and `nd->link_inode` looks like when `dentry` is > > > already been killed in `__dentry_kill`. > > > > > > ``` > > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > > ``` > > > > > > It looks like `dentry->d_inode` could be NULL while `nd->link_inode` > > > is always has value. > > > But this make me confuse, by right `nd->link_inode` is get from > > > `dentry->d_inode`, right? > > > > It's sampled from there, yes. And in RCU mode there's nothing to > > prevent a previously positive dentry from getting negative and/or > > killed. ->link_inode (used to - it's gone these days) go with > > ->seq, which had been sampled from dentry->d_seq before fetching > > ->d_inode and then verified to have ->d_seq remain unchanged. > > That gives you "dentry used to have this inode at the time it > > had this d_seq", and that's what gets used to validate the sucker > > when we switch to non-RCU mode (look at legitimize_links()). > > > > IOW, we know that > > * at some point during the pathwalk that sucker had this inode > > * the inode won't get freed until we drop out of RCU mode > > * if we need to go to non-RCU (and thus grab dentry references) > > while we still need that inode, we will verify that nothing has happened > > to that link (same ->d_seq, so it still refers to the same inode) and > > grab dentry reference, making sure it won't go away or become negative > > under us. Or we'll fail (in case something _has_ happened to dentry) > > and repeat the entire thing in non-RCU mode. > > > > -- > Best Regards, > Haosdent Huang -- Best Regards, Haosdent Huang