On Tue 11-08-15 15:16:26, Oleg Nesterov wrote: > On 08/11, Dave Chinner wrote: > > > > On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote: > > > > > > One would like to construct the lock chain as: > > > > > > CPU0 (chown foo dir) CPU1 (readdir dir) CPU2 (page fault) > > > process Y process X, thread 0 process X, thread 1 > > > > > > get ILOCK for dir > > > gets freeze protection > > > starts transaction in xfs_setattr_nonsize > > > waits to get ILOCK on 'dir' > > > get mmap_sem for X > > > wait for mmap_sem for process X > > > in filldir() > > > wait for freeze protection in > > > xfs_page_mkwrite > > > > > > and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to > > > finish it's freeze-protected section. But this cannot happen. The reason is > > > that we block writers level-by-level and thus while there are writers at > > > level X, we do not block writers at level X+1. So in this particular case > > > freeze_super() will block waiting for CPU0 to finish its freeze protected > > > section while CPU2 is free to continue. > > > > > > In general we have a chain like > > > > > > freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\ > > > A | > > > \------------------------------------------/ > > > > > > But since ILOCK is always acquired with freeze protection at L0 and we can > > > block at L1 only after there are no writers at L0, this loop can never > > > happen. > > > > > > Note that if we use the property of freezing that lock at level X+1 cannot > > > block when we hold lock at level X, we can as well simplify the dependency > > > graph and track in it only the lowest level of freeze lock that is > > > currently acquired (since the levels above it cannot block and do not in > > > any way influence blocking of other processes either and thus are > > > irrelevant for the purpose of deadlock detection). Then the dependency > > > graph we'd get would be: > > > > > > freeze L0 -> ILOCK -> mmap_sem -> freeze L1 > > > > > > and we have a nice acyclic graph we like to see... So probably we have to > > > hack the lockdep instrumentation some more and just don't tell lockdep > > > about freeze locks at higher levels if we already hold a lock at lower > > > level. Thoughts? > > > > The XFS directory ilock->filldir->might_fault locking path has been > > generating false positives in quite a lot of places because of > > things we do on one side of the mmap_sem in filesystem paths vs > > thigs we do on the other side of the mmap_sem in the page fault > > path. > > OK. Dave, Jan, thanks a lot. > > I was also confused because I didn't know that "Chain exists of" part > of print_circular_bug() only prints the _partial_ chain, and I have > to admit that I do not even understand which part it actually shows... > > I'll drop > > move rwsem_release() from sb_wait_write() to freeze_super() > change thaw_super() to re-acquire s_writers.lock_map > > from the previous series and resend everything. Lets change sb_writers to > use percpu_rw_semaphore first, then try to improve the lockdep annotations. Yeah, that sounds like a good plan. > See the interdiff below. With this change I have > > TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \ > ./check `grep -il freeze tests/*/???` > > ... > > Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297 > Passed all 7 tests > > anything else I should test? The diff looks good and if these tests pass without a warning then we can be reasonably confident things are fine. > this needs a comment in sb_wait_write() to explain that this is not what > we want. Yup, would be nice. Honza > > --- a/fs/super.c > +++ b/fs/super.c > @@ -1215,27 +1215,15 @@ EXPORT_SYMBOL(__sb_start_write); > static void sb_wait_write(struct super_block *sb, int level) > { > percpu_down_write(sb->s_writers.rw_sem + level-1); > + percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_); > } > > -static void sb_freeze_release(struct super_block *sb) > -{ > - int level; > - /* Avoid the warning from lockdep_sys_exit() */ > - for (level = 0; level < SB_FREEZE_LEVELS; ++level) > - percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_); > -} > - > -static void sb_freeze_acquire(struct super_block *sb) > +static void sb_freeze_unlock(struct super_block *sb) > { > int level; > > for (level = 0; level < SB_FREEZE_LEVELS; ++level) > percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_); > -} > - > -static void sb_freeze_unlock(struct super_block *sb) > -{ > - int level; > > for (level = SB_FREEZE_LEVELS; --level >= 0; ) > percpu_up_write(sb->s_writers.rw_sem + level); > @@ -1331,7 +1319,6 @@ int freeze_super(struct super_block *sb) > * sees write activity when frozen is set to SB_FREEZE_COMPLETE. > */ > sb->s_writers.frozen = SB_FREEZE_COMPLETE; > - sb_freeze_release(sb); > up_write(&sb->s_umount); > return 0; > } > @@ -1358,14 +1345,11 @@ int thaw_super(struct super_block *sb) > goto out; > } > > - sb_freeze_acquire(sb); > - > if (sb->s_op->unfreeze_fs) { > error = sb->s_op->unfreeze_fs(sb); > if (error) { > printk(KERN_ERR > "VFS:Filesystem thaw failed\n"); > - sb_freeze_release(sb); > up_write(&sb->s_umount); > return error; > } > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html