On Mon, Jul 18, 2011 at 12:47 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > Huh? We do __d_drop() in there, and do that before we start messing > with ->d_inode. Hmm. Yes, looking at it, the ordering all seems correct. But then what did Hugh see at all? The inode thing he got from d_inode is re-verified by __d_lookup_rcu(). So if inode is NULL, that means that the other CPU has done dentry_iput(), which means that __d_drop has already happened, which means that the dentry has been removed from the hash list *and* the count has been incremented. So just judging by Hugh's thing, something is wrong there. Some missing barrier, perhaps. But write_seqcount_barrier() does seem to have the write barrier, and the __d_lookup_rcu() read barrier check is using the proper "full" read barrier too. So in order for __d_lookup_rcu() to return a NULL inode, we have the following requirements: - it needs to find the dentry (duh) - the dentry sequence count gets read (proper read barrier after this) - the dentry must still look hashed - the dentry->d_inode must be NULL - and the dentry sequence count gets re-read (proper read barrier before this) and must match. And right now I don't see how that can happen, exactly because __d_lookup_rcu does the sequence point check. But that's what Hugh's patch depends on: seeing a NULL d_inode race. If we see d_inode being NULL, that means that the first sequence number read must happen *after* we've set d_inode to NULL in dentry_iput(), which happens *after* we've done the sequence number increment. That part is fine: that means that the sequence numbers will match (because both of them are the "later" one). No inconsistency so far. d_inode being NULL is perfectly compatible with no sequence number change, and that was what I thought was a bug in this area at first. But if the first sequence number read (the read_seqcount_begin() in __d_lookup_rcu()) sees the later sequence number, that means that __d_drop has already happened, and it should *also* see the dentry->d_hash.pprev = NULL; that happened in __d_drop before the dentry_rcuwalk_barrier(). We have both the read barrier in the read_seqcount_begin() and the write barrier in the dentry_rcuwalk_barrier() to guarantee that. But if the __d_lookup_rcu() sees that, then the d_unhashed() should have seen the NULL pprev pointer, and not ever returned the dentry, because that __d_lookup_rcu() loop does if (d_unhashed(dentry)) continue; after reading the sequence count. So how could Hugh's NULL inode ever happen in the first place? Even with the current sources? It all looks solid to me now that I look at all the details. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html