On Sun, 17 Jul 2011, Linus Torvalds wrote: > On Sun, Jul 17, 2011 at 2:03 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > That -ENOENT in walk_component: isn't it assuming we found a negative > > dentry, before reaching the read_seqcount_retry which complete_walk > > (or nameidata_drop_rcu_last before 3.0) would use to confirm a successful > > lookup? > > Hmm. I think you're right. The ENOENT will basically short-circuit the > full proper checks. > > > And can't memory pressure prune a dentry, coming to dentry_kill > > which __d_drops to unhash before dentry_iput resets d_inode to NULL, but > > the dentry_rcuwalk_barrier between those is ineffective if the other end > > ignores the seqcount? > > Yes. However, looking at it, I'm not very happy with your patch. It > doesn't really make sense to me to special-case the NULL inode and > only do a seq_retry for that case. > > I kind of see why you do it for that particular bug, but at the same > time, it just makes me go "Eww". If that inode isn't NULL yet, you > then return the dentry that can get a NULL d_inode later. So the only > special thing there is that we happen to check for a NULL inode there. > What protects *later* checks for a NULL d_inode? I was imagining that all the later uses of the inode were using walk_component()'s local struct inode *inode, or nd->inode which it sets on success. Until complete_walk(), or the next level down of lookup, has validated that stage by checking nd->seq. If any does dereference dentry->d_inode in between, then it would already be oopsing in this situation, which I've not seen. > > So my gut feel is that we should instead > > - either remove the -ENOENT return at that point entirely, and move > it to after we have re-verified the dentry lookup for other reasons. > That looks pretty involved, though, and those paths do end up > accessing inode data structures etc, so it looks less than trivial. > > OR > > - simply just not clear d_inode at all in d_kill(), so that when we > prune a dentry due to memory pressure, it doesn't actually change the > state of the dentry. But whether my imagining was right or wrong, your -1 line patch looks a much nicer solution. > > and I think the second solution is the right one. It's kind of odd: > we'll have called down to the iput() routine, and the inode will be > "gone", but that is already true for the *normal* race of actually > deleting a file too, and we have that whole "inodes are RCU-released", > so the inode allocation will still exist. Yes, the inodes are RCU-released too, so that side of it fits okay. > > So my gut feel is that we should instead just do this: > > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -187,7 +187,6 @@ static void dentry_iput(struct dentry * dentry) > { > struct inode *inode = dentry->d_inode; > if (inode) { > - dentry->d_inode = NULL; > list_del_init(&dentry->d_alias); > spin_unlock(&dentry->d_lock); > spin_unlock(&inode->i_lock); > > and see what the fall-out from that would be. Nobody should then *use* > the stale inode, because __d_drop has done that > dentry_rcuwalk_barrier(). So we avoid the NULL inode special case > entirely. > > Comments? At first it looked worrying to interfere with the sequence "inode = dentry->d_inode; if (inode) { dentry->d_inode = NULL;" but seeing as dentry_iput() is only called from the one place, I think the test is merely about negative dentries, and setting d_inode to NULL is just long-standing tidiness, nothing vital. > > The above (whitespace-damaged) patch may look trivial, but it is > *entirely* untested, and maybe my gut feel that the above is the right > way to solve the problem is just wrong. > > Al, any reactions? Hugh, does the above patch work for your > stress-test case? Or, indeed, at all? Well, my stress tests don't blow up in the first half hour. But I was not successful in re-reproducing the issue (I got yesterday in 2.5 hours) within 7 hours, so it's probably going to take days to be sure that your minus-1-liner fixes it. I certainly like your patch, though I'm not quite as confident with it as I was with mine (until you pointed out its inconsistency). Just a nagging doubt that leaving d_inode set may be wrong for somewhere. Hugh