Hi Dave, On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote: > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote: > > Hi! I was playing a shell game with some precious backup data. It > > went like this: > > > > open a 36-GB source partition (DM linear, partitions from 2 drives, > > v5-superblock XFS) > > > > open a 32-GB aes-xts crypt (v4-superblock XFS) > > > > `cp -av` from holding partition to crypt > > > > It was during the cp operation that I got this multi-CPU version of > > the harmless lockdep: > > > > ========================================================= > > [ INFO: possible irq lock inversion dependency detected ] > > 3.14.0-rc5+ #6 Not tainted > > --------------------------------------------------------- > > kswapd0/25 just changed the state of lock: > > (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c > > but this lock took another, RECLAIM_FS-unsafe lock in the past: > > (&mm->mmap_sem){++++++} > > > > and interrupts could create inverse lock ordering between them. > > > > > > other info that might help us debug this: > > Possible interrupt unsafe locking scenario: > > > > CPU0 CPU1 > > ---- ---- > > lock(&mm->mmap_sem); > > local_irq_disable(); > > lock(&xfs_dir_ilock_class); > > lock(&mm->mmap_sem); > > <Interrupt> > > lock(&xfs_dir_ilock_class); > > Well, that's rather interesting. I'd love to know where we take the > directory ilock with interupts disabled - and then take a page > fault... :/ > > .... > > > } > > ... key at: [<795f1eec>] __key.44037+0x0/0x8 > > ... acquired at: > > [<79064360>] __lock_acquire+0x397/0xa6c > > [<7906510f>] lock_acquire+0x8b/0x101 > > [<790d1718>] might_fault+0x81/0xa9 > > [<790f379a>] filldir64+0x92/0xe2 > > [<7916f079>] xfs_dir2_sf_getdents+0x1a0/0x44c > > [<7917009e>] xfs_readdir+0xc4/0x126 > > [<79171b8b>] xfs_file_readdir+0x25/0x3e > > [<790f385a>] iterate_dir+0x70/0x9b > > [<790f3a31>] SyS_getdents64+0x6d/0xcc > > [<792f85b8>] sysenter_do_call+0x12/0x36 > > All of these paths are from the syscall path, except for one in > kswapd context. I don't see any stack trace here that could result > in the above "deadlock". It looks to be yet another false > positive... > > > I call it harmless because its single-CPU variant can be reproduced > > as far back as I could bisect in earlier testing (way before > > kernel 2.6.20). However, such lockdep splats have never manifested > > in a noticeable problem on production kernels on x86. Either > > down_write_nested or down_read_nested is in all of them, depending > > on which kernel version is in use. At least one reclaim-related > > function is in there as well. vm_map_ram used to be in there, but Dave > > took care of that (thanks!). > > > > This particular splat has been showing up more often, though. It's not > > tied to one particular commit, event, or change; just something that > > has crept in gradually since maybe kernel 3.11. It's like watching > > grass grow or watching paint dry. > > The above reports all indicate that the taking a page fault while > holding the directory ilock is problematic. So have all the other > new lockdep reports we've got that have been caused by that change. > > Realistically, I think that the readdir locking change was > fundamentally broken. Why? Because taking page faults while holding > the ilock violates the lock order we have for inodes. For regular > files, the lock heirarchy is iolock -> page fault -> ilock, and we > got rid of the need for the iolock in the inode reclaim path because > of all the problems it caused lockdep. Now we've gone and > reintroduced that same set of problems - only worse - by allowing > the readdir code to take page faults while holding the ilock.... > > I'd like to revert the change that introduced the ilock into the > readdir path, but I suspect that woul dbe an unpopular direction to > take. However, we need a solution that doesn't drive us insane with > lockdep false positives. > > IOWs, we need to do this differently - readdir needs to be protected > by the iolock, not the ilock, and only the block mapping routines in > the readdir code should be protected by the ilock. i.e. the same > locking strategy that is used for regular files: iolock protects the > contents of the file, the ilock protects the inode metadata. > > What I think we really need to do is drive the ilock all the way > into the block mapping functions of the directory code and to the > places where the inode metadata is actually going to be modified. > The iolock protects access to the directory data, and logging of > changes to those blocks is serialised by the buffer locks, so we > don't actually need the inode ilock to serialise access to the > directory data - we only need the ilock for modifications to the > inode and it's blockmaps itself, same as for regular files. > > Changing the directory code to handle this sort of locking is going > to require a bit of surgery. However, I can see advantages to moving > directory data to the same locking strategy as regular file data - > locking heirarchies are identical, directory ilock hold times are > much reduced, we don't get lockdep whining about taking page faults > with the ilock held, etc. > > A quick hack at to demonstrate the high level, initial step of using > the IOLOCK for readdir serialisation. I've done a little smoke > testing on it, so it won't die immediately. It should get rid of all > the nasty lockdep issues, but it doesn't start to address the deeper > restructing that is needed. > > Thoughts, comments? I'm especially interested in what SGI people > have to say here because this locking change was needed for CXFS, > and changing the directory locking iheirarchy is probably going to > upset CXFS a lot... Protecting directory contents with the iolock and pushing the ilock down to protect the directory block mapping functions seems like a good strategy to me. In this area I think it's good to treat directories the same as regular files. I have not had a very close look yet... and couldn't find where we take a fault with irqs disabled. Maybe something out of a workqueue? Thanks, Ben _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs