On Sat, 1 Aug 2015, Linus Torvalds wrote: > On Sat, Aug 1, 2015 at 9:06 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in > > dentry_iput() is not of continuing concern; but don't worry, there's > > plenty I don't understand - so long as you're both satisfied that > > it's not a concern, no need to persuade me.) > > So dentry_iput() is only called as the dentry is being thrown away, > and is stale. > > Yes, such a stale dentry can be seen by an RCU lookup, but the RCU > lookups should always revalidate things after the lookup, so it > shouldn't matter. The problem here was that there was a missing > revalidate of the RCU lookup for an error case, so the error that > _should_ have been a harmless race that got handled later by the > proper validation instead turned into a real user-visible error. Thank you both for leading me through that: I really should have rechecked the sequence count invalidation in the source for myself (I had a wrong picture of it in my head), before inserting that parenthesis and taking your time over it; but had been in a hurry to get a response back. > > But we didn't use to clear the flags in dentry_iput, so before things > generally "happened to work" anyway, because this rare error case > didn't actually ever trigger in the first place. > > (And I still don't think we necessarily *should* clear the flags in > dentry_iput(), but it really shouldn't be a correctness issue) > > > Do we have any idea why a bug introduced in v3.13 should only now > > stand out, both for Dominique and for me? Has the RCU lookup somehow > > become much more effective recently? > > So I do think that the clearing of the dentry flags exposed a > situation that was harder to hit before. Right, that does indeed make sense of why it appeared now. I cannot actually report success from yesterday's testing, since it hung after 20 hours for, I believe, the same unrelated reason that I ran into before. I mentioned jbd2 last time, but I doubt that's at fault: it's almost certainly an issue with recent vmscan changes and/or recent loop changes - the business of page reclaim waiting on page writeback has always been tricky and fragile and deadlock-prone, the more so when loop is involved: probably the balance has got shifted slightly by recent changes, I'll look into it (but definitely not rc5 material). Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html