Hi, Alexander, thanks a lot for your detailed explanations. I take a look at the code again and the thread and I realize there are some incorrect representations in my previous word. >> struct inode *d_inode = 0x0 <======= d_inode is NULL and cause Oops! Actually it is Oops at `inode->i_flags` directly instead of `d_inode`, so the code would not further to change the access time. ``` bool __atime_needs_update(const struct path *path, struct inode *inode, bool rcu) { struct vfsmount *mnt = path->mnt; struct timespec now; if (inode->i_flags & S_NOATIME) <======= Oops at here according to `[19521409.372016] IP: __atime_needs_update+0x5/0x190`. return false; ``` But it looks impossible if we take a look at "walk_component > lookup_fast". Let me introduce why it goes through "walk_component > lookup_fast" instead of "walk_component > lookup_slow" first, in "walk_component > step_into > pick_link", the code would set `nameidata->stack->seq` to `seq` which is comes from the passed in parameters. If the code goes through "walk_component > lookup_slow", `seq` would be 0 and then `nameidata->stack->seq` would be 0. If the code goes through "walk_component > lookup_slow", `seq` would be `dentry->d_seq->sequence`. According to the contents in attachment files "nameidata.txt" and "dentry.txt", `dentry->d_seq->sequence` is 4, and `nameidata->stack->seq` is 4 as well. So looks like the code goes through "walk_component > lookup_fast" and "walk_component > step_into > pick_link". The `inode` parameter in `__atime_needs_update` comes from `nameidata->link_inode`. But in attachment file "nameidata.txt", we could found `nameidata->link_inode` is NULL already. Because the code goes through "walk_component > lookup_fast" and "walk_component > step_into > pick_link", the `inode` assign to `nameidata->link_inode` must comes from `lookup_fast`. So it looks like something wrong in `lookup_fast`. Let me continue to explain why this looks impossible. In `walk_component`, `lookup_fast` have to return 1 (> 0), otherwise it would fallback to `lookup_slow`. ``` err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) return err; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); } return step_into ``` Because for our case, it looks like the code goes through "walk_component > lookup_fast" and "walk_component > step_into > pick_link", This infers `lookup_fast` return 1 in this Oops. Because `lookup_fast` return 1, it looks like the code goes through the following path. ``` if (nd->flags & LOOKUP_RCU) { ... *inode = d_backing_inode(dentry); negative = d_is_negative(dentry); ... ... if (negative) return -ENOENT; path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) return 1; ``` As we see, if `*inode` is NULL, it should return `-ENOENT` because `if (negative)` would be true, which is conflict with "`lookup_fast` return 1". And in `__d_clear_type_and_inode`, it always sets the dentry to negative first and then sets d_inode to NULL. ``` static inline void __d_clear_type_and_inode(struct dentry *dentry) {` unsigned flags = READ_ONCE(dentry->d_flags); flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); // Set dentry to negative first. WRITE_ONCE(dentry->d_flags, flags); // memory barrier dentry->d_inode = NULL; // Then set d_inode to NULL. } ``` So looks like `inode` in `lookup_fast` should not be NULL if it could skip `if (negative)` check even in the RCU case. Unless ``` # in lookup_fast method *inode = d_backing_inode(dentry); negative = d_is_negative(dentry); ``` is reorder to ``` # in lookup_fast method negative = d_is_negative(dentry); *inode = d_backing_inode(dentry); ``` when CPU executing the code. But is this possible in RCU? I diff my local ubuntu's code with upstream tag v4.15.18, it looks like no different in `fs/namei.c`, `fs/dcache.c`, `fs/proc`. So possible the problem may happen to upstream tag v4.15.18 as well, sadly my script still could not reproduce the issue on the server so far, would like to see if any insights from you then I could continue to check what's wrong in this Oops, thank you in advance! On Tue, Apr 27, 2021 at 1:30 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Apr 27, 2021 at 01:16:44AM +0800, haosdent wrote: > > > really should not assume ->d_inode stable > > > > Hi, Alexander, sorry to disturb you again. Today I try to check what > > `dentry->d_inode` and `nd->link_inode` looks like when `dentry` is > > already been killed in `__dentry_kill`. > > > > ``` > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > nd->last.name: net/sockstat, dentry->d_lockref.count: -128, > > dentry->d_inode: (nil), nd->link_inode: 0xffffffffab299966 > > ``` > > > > It looks like `dentry->d_inode` could be NULL while `nd->link_inode` > > is always has value. > > But this make me confuse, by right `nd->link_inode` is get from > > `dentry->d_inode`, right? > > It's sampled from there, yes. And in RCU mode there's nothing to > prevent a previously positive dentry from getting negative and/or > killed. ->link_inode (used to - it's gone these days) go with > ->seq, which had been sampled from dentry->d_seq before fetching > ->d_inode and then verified to have ->d_seq remain unchanged. > That gives you "dentry used to have this inode at the time it > had this d_seq", and that's what gets used to validate the sucker > when we switch to non-RCU mode (look at legitimize_links()). > > IOW, we know that > * at some point during the pathwalk that sucker had this inode > * the inode won't get freed until we drop out of RCU mode > * if we need to go to non-RCU (and thus grab dentry references) > while we still need that inode, we will verify that nothing has happened > to that link (same ->d_seq, so it still refers to the same inode) and > grab dentry reference, making sure it won't go away or become negative > under us. Or we'll fail (in case something _has_ happened to dentry) > and repeat the entire thing in non-RCU mode. -- Best Regards, Haosdent Huang