Re: Multi-CPU harmless lockdep on x86 while copying data

Ben Myers <bpm@xxxxxxx> · Mon, 10 Mar 2014 16:30:49 -0500

Hi Dave,

On Tue, Mar 11, 2014 at 08:20:27AM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2014 at 03:52:49PM -0500, Ben Myers wrote:
> > Hi Dave,
> > 
> > On Mon, Mar 10, 2014 at 01:55:23PM +1100, Dave Chinner wrote:
> > > On Sat, Mar 08, 2014 at 09:58:01PM -0500, Michael L. Semon wrote:
> > > > Hi!  I was playing a shell game with some precious backup data.  It 
> > > > went like this:
> > > > 
> > > > open a 36-GB source partition (DM linear, partitions from 2 drives, 
> > > >    v5-superblock XFS)
> > > > 
> > > > open a 32-GB aes-xts crypt (v4-superblock XFS)
> > > > 
> > > > `cp -av` from holding partition to crypt
> > > > 
> > > > It was during the cp operation that I got this multi-CPU version of 
> > > > the harmless lockdep:
> > > > 
> > > > =========================================================
> > > > [ INFO: possible irq lock inversion dependency detected ]
> > > > 3.14.0-rc5+ #6 Not tainted
> > > > ---------------------------------------------------------
> > > > kswapd0/25 just changed the state of lock:
> > > >  (&xfs_dir_ilock_class){++++-.}, at: [<791c09fa>] xfs_ilock+0xff/0x16c
> > > > but this lock took another, RECLAIM_FS-unsafe lock in the past:
> > > >  (&mm->mmap_sem){++++++}
> > > > 
> > > > and interrupts could create inverse lock ordering between them.
> > > > 
> > > > 
> > > > other info that might help us debug this:
> > > >  Possible interrupt unsafe locking scenario:
> > > > 
> > > >        CPU0                    CPU1
> > > >        ----                    ----
> > > >   lock(&mm->mmap_sem);
> > > >                                local_irq_disable();
> > > >                                lock(&xfs_dir_ilock_class);
> > > >                                lock(&mm->mmap_sem);
> > > >   <Interrupt>
> > > >     lock(&xfs_dir_ilock_class);
> > > 
> > > Well, that's rather interesting. I'd love to know where we take the
> > > directory ilock with interupts disabled - and then take a page
> > > fault...  :/
> .....
> > > Realistically, I think that the readdir locking change was
> > > fundamentally broken. Why? Because taking page faults while holding
> > > the ilock violates the lock order we have for inodes.  For regular
> > > files, the lock heirarchy is iolock -> page fault -> ilock, and we
> > > got rid of the need for the iolock in the inode reclaim path because
> > > of all the problems it caused lockdep. Now we've gone and
> > > reintroduced that same set of problems - only worse - by allowing
> > > the readdir code to take page faults while holding the ilock....
> > > 
> > > I'd like to revert the change that introduced the ilock into the
> > > readdir path, but I suspect that woul dbe an unpopular direction to
> > > take. However, we need a solution that doesn't drive us insane with
> > > lockdep false positives.
> > > 
> > > IOWs, we need to do this differently - readdir needs to be protected
> > > by the iolock, not the ilock, and only the block mapping routines in
> > > the readdir code should be protected by the ilock. i.e.  the same
> > > locking strategy that is used for regular files: iolock protects the
> > > contents of the file, the ilock protects the inode metadata.
> > > 
> > > What I think we really need to do is drive the ilock all the way
> > > into the block mapping functions of the directory code and to the
> > > places where the inode metadata is actually going to be modified.
> > > The iolock protects access to the directory data, and logging of
> > > changes to those blocks is serialised by the buffer locks, so we
> > > don't actually need the inode ilock to serialise access to the
> > > directory data - we only need the ilock for modifications to the
> > > inode and it's blockmaps itself, same as for regular files.
> > > 
> > > Changing the directory code to handle this sort of locking is going
> > > to require a bit of surgery. However, I can see advantages to moving
> > > directory data to the same locking strategy as regular file data -
> > > locking heirarchies are identical, directory ilock hold times are
> > > much reduced, we don't get lockdep whining about taking page faults
> > > with the ilock held, etc.
> > > 
> > > A quick hack at to demonstrate the high level, initial step of using
> > > the IOLOCK for readdir serialisation. I've done a little smoke
> > > testing on it, so it won't die immediately. It should get rid of all
> > > the nasty lockdep issues, but it doesn't start to address the deeper
> > > restructing that is needed.
> > > 
> > > Thoughts, comments? I'm especially interested in what SGI people
> > > have to say here because this locking change was needed for CXFS,
> > > and changing the directory locking iheirarchy is probably going to
> > > upset CXFS a lot...
> > 
> > Protecting directory contents with the iolock and pushing the ilock down
> > to protect the directory block mapping functions seems like a good
> > strategy to me. 
> 
> Ok, it seems like a good strategy, but does it work for CXFS?

Yes it should work for CXFS.

> What about the other option of a combination of ilock for mapping and
> read, and buffer locks for stabilisation?

I think this option probably won't work because of those cases where you
modify >1 buffer for a single directory modification, e.g. in a node
directory you might modify a buffer each for the data, freeindex, node,
and leaf blocks.

> > In this area I think it's good to treat directories the
> > same as regular files.  I have not had a very close look yet... and
> > couldn't find where we take a fault with irqs disabled.  Maybe something
> > out of a workqueue?
> 
> Workqueues don't run with interrupts disabled. Yes, we take inode
> locks in workqueue context (e.g. AIL tail pushing), but that's not
> what is being warned about here. And the page fault only comes from
> the filldir callback which runs copy_to_user(), and that most
> definitely does not occur in interrupt context.
> 
> FWIW, it's invalid to take the ilock in interrupt context because
> it's a sleeping lock without using trylock semantics. If you use
> "trylock" semantics, then lockdep doesn't complain about that
> because it can't deadlock. I *think* that lockdep is complaining
> about reclaim contexts, not interrupts, and it's issuing this
> "interrupts disabled" deadlocks because the memory reclaim state
> code is piggy-backed on the interrupt state tracking within lockdep.
> Either way, though, lockdep is wrong.

Yeah, that sounds plausible.

-Ben

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs