On Wed, Apr 06, 2016 at 07:22:49PM +1000, Dave Chinner wrote: > Hi folks, > > There is a problem that RHEL QE tripped over on a long-running > fsstress test on a RHEL 6.6. Brian did all the hard work of working > out the initial cause of the GPF that was being tripped, but > had not yet got past how to fix the issues around xfs_free_inode. > The code is the same as the current upstream code, so the problem > still exists.... > > The first patch fixes an obvious (now!) bug in xfs_iflush_cluster > where it checks the wrong inode after lookup for validity. It still > kind-of works, because the correct inode number is used for the "are > we still in the right cluster" check, so it's not quite a hole the > size of a truck. Still something that should not have slipped > through 6 years ago and not been discovered until now... > > The most important patch (#4) address the use-after-free issues that the > xfs inode has w.r.t. RCU freeing and the lookup that > xfs_iflush_cluster is doing under the rcu_read_lock. All the > structures accessed under the RCU context need to be freed after the > current RCU grace period expires, as RCU lookups may attempt to > access them at any time during the grace period. hence we have to > move them into the RCU callback so that we don't free them > prematurely. > > The rest of the patches are defensive in that they make > xfs_iflush_cluster only act on relevant inodes and be able to > guarantee detection inodes that are in the process of being freed. > While these aren't absolutely necessary, it seems silly to ignore > these obvious issues while I'm fixing up other issues with the same > code. > > There's some more detail on the fixes in the commit descriptions. > > Brian, I've only run this through xfstests, so I have no real idea > if it fixes the problem fsstress has uncovered. AIUI it takes 3 or 4 > days to reproduce the issue so this is kind of a pre-emptive strike > on what I think is the underlying issue based on your description > and commentary. I figured having code to explain the problems would > save some time while you sleep.... > > Comments, thoughts, testing and flames all welcome... > Thanks.. it looks mostly clean and logical to me on a first pass through. I've reviewed the first three, which look straightforward, and I'll follow up on the last couple once I've run through testing and had a closer look. To migrate some of the discussion over to the list... what I'm testing now is an XFS_IRECLAIM flag check (and inode skip) on the inode in the cluster flush path. This has run clean for the better part of yesterday through overnight. While the test requires a few days to verify, the hacked up kernel I've been testing with usually detects this a bit sooner (within 6-8 hours or so), so I'm fairly confident that test at least corners the problem. The concern raised with the XFS_IRECLAIM check is that of performance (skipping an inode flush when we don't need to). I'm not totally convinced that's much of a problem considering how rare this situation is. The best/only reproducer we have takes a couple days on a debug kernel, for a bug that's been present for several years. The concern I had with this approach was more that both codepaths seemed to have trylock or checks that cause the reclaim or flush to "back off," and I didn't want to create a situation where neither path would actually do its job (potentially repeatedly) for a particular inode. So that approach seemed cleaner than the alternative I was considering, but the thought of pushing the i_itemp deallocation to the RCU callback also crossed my mind later on yesterday after writing up my last report on the bug. This series appears to move all of the deallocation (incl. fork destruction) to the RCU cb, but that seems reasonable to me as well. I think relying on RCU is probably the cleanest/most correct approach so far, so I'll kill the currently running test, pull these back to the rhel kernel and see how that goes. Note that I haven't even attempted to reproduce this on an upstream kernel simply because it's so difficult to reproduce. It's not even consistently reproducible with the original kernel on different hardware. I agree that the problem likely exists upstream, but I'm going to continue to use the original kernel for verification because that hw/kernel has provided a consistent reproducer... Brian > -Dave. > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs