Re: [PATCH] deadlock with suspend and quotas

Jan Kara <jack@xxxxxxx> · Tue, 29 Nov 2011 11:21:56 +0100

  Sorry for replying once more - forgot to add those CCs - so please reply
to this email...

On Tue 29-11-11 11:19:01, Jan Kara wrote:
>   Hi,
> 
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > Where can I get that patch set?
> > > 
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or 
> > > background writeback (these code paths take s_umount and wait trying to do 
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel? 
> > > Are there other known deadlock possibilities?
> > 
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write 
> > deadlock" (I couldn't find the next two parts of the patch in the 
> > archives). And the patch looks wrong:
>   Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
> 
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not 
> > held when the filesystem is frozen and it is taken for write when thawing. 
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount) 
> > on a frozen filesystem and if this tasks attempts to do an I/O that is 
> > waiting for thaw, it may still deadlock.
>   Agreed.
> 
> > - skipping sync on frozen filesystem violates sync semantics. 
> > Applications, such as databases, assume that when sync finishes, data were 
> > written to stable storage. If we skip sync when the filesystem is frozen, 
> > we can cause data corruption in these applications (if the system crashes 
> > after we skipped a sync).
>   Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
> 
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
> 
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.
> 
> > - I'm not sure what userspace quota subsystem will do if we start 
> > returning -EBUSY spuriously.
>   Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
> 
> > There is another thing --- I wasn't able to reproduce these sync-related 
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any 
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel 
> > prevents creating new dirty data, syncs all data, and freezes the 
> > filesystem. Consequently, the sync function never finds any dirty data and 
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know 
> > why).
>   See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.
> 
> > I made this patch that fixes the quota issue and possible sync issues. It 
> > introduces a function down_read_s_umount(sb); --- this function takes 
> > s_umount lock for read, but it waits if the filesystem is suspended.
> > 
> > There are three more potentially unsafe users: fs/cachefiles/interface.c, 
> > fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't 
> > test them.
> > 
> > Mikulas
> > 
> > ---
> > 
> > Fix a s_umount deadlock
> > 
> > The lock s_umount is taken for write and dropped when we freeze and resume
> > a block device.
> > 
> > Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> > happen if the device is suspended.
> > 
> > Unfortunatelly, it seems that developers are not aware of this restriction
> > that I/O must not be peformed with s_umount held.
> > 
> > These deadlocks were observed:
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> >   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> >   call_rwsem_down_write_failed (the process is trying to resume frozen device, 
> >   but it is waiting trying to acquire s_umount)
> > * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
> >   ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
> >   start_this_handle (the process is holding s_umount and trying to perform
> >   a transaction, waiting for the device being unfrozen)
> > 
> > * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
> >   writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
> >   (the process is holding s_umount for read and trying to perform some I/O)
> > * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
> >   wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
> >   writeback_single_inode -> do_writepages -> ext4_da_writepages ->
> >   ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
> >   (holding s_umount for read too, acquired in writeback_inodes_wb,
> >   and trying to perform I/O)
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> >   ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> >   call_rwsem_down_write_failed (just like in a previous case: trying to
> >   resume frozen device, waiting on s_umount)
> > 
> > This patch fixes these observed code paths:
> > * introduce a function to safely take s_unlock - down_read_s_umount. It takes
> >   the lock, check if a filesystem is frozen, if it is, drops the lock and
> >   waits for unfreeze.
> > * get_super function has a parameter, if the parameter is true, it waits for
> >   the device to unfreeze (it fixes quota deadlock and a possible deadlock in
> >   fsync_bdev and __invalidate_device)
> > * the same for iterate_supers (it fixes the sync deadlock and a possible
> >   deadlock in drop_caches_sysctl_handler)
> > * grab_super_passive fails if the device is suspended (fixes the inode writeback
> >   deadlock and a possible deadlock in prune_super)
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> > CC: stable@xxxxxxxxxx
> > 
> > ---
> >  fs/block_dev.c           |    6 +++---
> >  fs/buffer.c              |    2 +-
> >  fs/drop_caches.c         |    2 +-
> >  fs/fs-writeback.c        |    8 ++++++++
> >  fs/quota/quota.c         |    4 ++--
> >  fs/super.c               |   29 ++++++++++++++++++++---------
> >  fs/sync.c                |    4 ++--
> >  include/linux/fs.h       |   28 +++++++++++++++++++++++++---
> >  security/selinux/hooks.c |    2 +-
> >  9 files changed, 63 insertions(+), 22 deletions(-)
> > 
> > Index: linux-3.2-rc3-fast/fs/quota/quota.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/quota/quota.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/quota/quota.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
> >  		return -EINVAL;
> >  	ret = security_quotactl(Q_SYNC, type, 0, NULL);
> >  	if (!ret)
> > -		iterate_supers(quota_sync_one, &type);
> > +		iterate_supers(quota_sync_one, &type, true);
> >  	return ret;
> >  }
> >  
> > @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
> >  	putname(tmp);
> >  	if (IS_ERR(bdev))
> >  		return ERR_CAST(bdev);
> > -	sb = get_super(bdev);
> > +	sb = get_super(bdev, true);
> >  	bdput(bdev);
> >  	if (!sb)
> >  		return ERR_PTR(-ENODEV);
> > Index: linux-3.2-rc3-fast/include/linux/fs.h
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/include/linux/fs.h	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/include/linux/fs.h	2011-11-25 05:56:03.000000000 +0100
> > @@ -10,6 +10,7 @@
> >  #include <linux/ioctl.h>
> >  #include <linux/blk_types.h>
> >  #include <linux/types.h>
> > +#include <linux/sched.h>
> >  
> >  /*
> >   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> > @@ -1502,6 +1503,26 @@ enum {
> >  	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
> >  
> >  /*
> > + * Take s_umount and make sure that the filesystem is not frozen.
> > + * This function must be used if we intend to perform any I/O while
> > + * holding s_umount.
> > + *
> > + * s_umount is taken for write when resuming a frozen filesystem,
> > + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> > + * it may deadlock.
> > + */
> > +static inline void down_read_s_umount(struct super_block *sb)
> > +{
> > +retry:
> > +	down_read(&sb->s_umount);
> > +	if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +		up_write(&sb->s_umount);
> > +		vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > +		goto retry;
> > +	}
> > +}
> > +
> > +/*
> >   * until VFS tracks user namespaces for inodes, just make all files
> >   * belong to init_user_ns
> >   */
> > @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
> >  extern void get_filesystem(struct file_system_type *fs);
> >  extern void put_filesystem(struct file_system_type *fs);
> >  extern struct file_system_type *get_fs_type(const char *name);
> > -extern struct super_block *get_super(struct block_device *);
> > +extern struct super_block *get_super(struct block_device *, bool);
> >  extern struct super_block *get_active_super(struct block_device *bdev);
> >  extern struct super_block *user_get_super(dev_t);
> >  extern void drop_super(struct super_block *sb);
> > -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> > +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
> >  extern void iterate_supers_type(struct file_system_type *,
> > -			        void (*)(struct super_block *, void *), void *);
> > +			        void (*)(struct super_block *, void *), void *,
> > +				bool);
> >  
> >  extern int dcache_dir_open(struct inode *, struct file *);
> >  extern int dcache_dir_close(struct inode *, struct file *);
> > Index: linux-3.2-rc3-fast/fs/buffer.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/buffer.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/buffer.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
> >  
> >  static void do_thaw_all(struct work_struct *work)
> >  {
> > -	iterate_supers(do_thaw_one, NULL);
> > +	iterate_supers(do_thaw_one, NULL, false);
> >  	kfree(work);
> >  	printk(KERN_WARNING "Emergency Thaw complete\n");
> >  }
> > Index: linux-3.2-rc3-fast/fs/super.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/super.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/super.c	2011-11-29 00:16:01.000000000 +0100
> > @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
> >  	spin_unlock(&sb_lock);
> >  
> >  	if (down_read_trylock(&sb->s_umount)) {
> > -		if (sb->s_root)
> > +		if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> >  			return true;
> >  		up_read(&sb->s_umount);
> >  	}
> > @@ -503,7 +503,7 @@ void sync_supers(void)
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> >  
> > -			down_read(&sb->s_umount);
> > +			down_read_s_umount(sb);
> >  			if (sb->s_root && sb->s_dirt)
> >  				sb->s_op->write_super(sb);
> >  			up_read(&sb->s_umount);
> > @@ -527,7 +527,8 @@ void sync_supers(void)
> >   *	Scans the superblock list and calls given function, passing it
> >   *	locked superblock and given argument.
> >   */
> > -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> > +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> > +		    bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb, *p = NULL;
> >  
> > @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
> >  		sb->s_count++;
> >  		spin_unlock(&sb_lock);
> >  
> > -		down_read(&sb->s_umount);
> > +		if (!wait_for_unfreeze)
> > +			down_read(&sb->s_umount);
> > +		else
> > +			down_read_s_umount(sb);
> >  		if (sb->s_root)
> >  			f(sb, arg);
> >  		up_read(&sb->s_umount);
> > @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
> >   *	locked superblock and given argument.
> >   */
> >  void iterate_supers_type(struct file_system_type *type,
> > -	void (*f)(struct super_block *, void *), void *arg)
> > +	void (*f)(struct super_block *, void *), void *arg,
> > +	bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb, *p = NULL;
> >  
> > @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
> >  		sb->s_count++;
> >  		spin_unlock(&sb_lock);
> >  
> > -		down_read(&sb->s_umount);
> > +		if (!wait_for_unfreeze)
> > +			down_read(&sb->s_umount);
> > +		else
> > +			down_read_s_umount(sb);
> >  		if (sb->s_root)
> >  			f(sb, arg);
> >  		up_read(&sb->s_umount);
> > @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
> >   *	mounted on the device given. %NULL is returned if no match is found.
> >   */
> >  
> > -struct super_block *get_super(struct block_device *bdev)
> > +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
> >  {
> >  	struct super_block *sb;
> >  
> > @@ -612,7 +620,10 @@ rescan:
> >  		if (sb->s_bdev == bdev) {
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> > -			down_read(&sb->s_umount);
> > +			if (wait_for_unfreeze)
> > +				down_read_s_umount(sb);
> > +			else
> > +				down_read(&sb->s_umount);
> >  			/* still alive? */
> >  			if (sb->s_root)
> >  				return sb;
> > @@ -672,7 +683,7 @@ rescan:
> >  		if (sb->s_dev ==  dev) {
> >  			sb->s_count++;
> >  			spin_unlock(&sb_lock);
> > -			down_read(&sb->s_umount);
> > +			down_read_s_umount(sb);
> >  			/* still alive? */
> >  			if (sb->s_root)
> >  				return sb;
> > Index: linux-3.2-rc3-fast/fs/sync.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/sync.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/sync.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
> >   */
> >  static void sync_filesystems(int wait)
> >  {
> > -	iterate_supers(sync_one_sb, &wait);
> > +	iterate_supers(sync_one_sb, &wait, true);
> >  }
> >  
> >  /*
> > @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> >  		return -EBADF;
> >  	sb = file->f_dentry->d_sb;
> >  
> > -	down_read(&sb->s_umount);
> > +	down_read_s_umount(sb);
> >  	ret = sync_filesystem(sb);
> >  	up_read(&sb->s_umount);
> >  
> > Index: linux-3.2-rc3-fast/fs/block_dev.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/block_dev.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/block_dev.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
> >   */
> >  int fsync_bdev(struct block_device *bdev)
> >  {
> > -	struct super_block *sb = get_super(bdev);
> > +	struct super_block *sb = get_super(bdev, true);
> >  	if (sb) {
> >  		int res = sync_filesystem(sb);
> >  		drop_super(sb);
> > @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
> >  		 * to freeze_bdev grab an active reference and only the last
> >  		 * thaw_bdev drops it.
> >  		 */
> > -		sb = get_super(bdev);
> > +		sb = get_super(bdev, false);
> >  		drop_super(sb);
> >  		mutex_unlock(&bdev->bd_fsfreeze_mutex);
> >  		return sb;
> > @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
> >  
> >  int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> >  {
> > -	struct super_block *sb = get_super(bdev);
> > +	struct super_block *sb = get_super(bdev, true);
> >  	int res = 0;
> >  
> >  	if (sb) {
> > Index: linux-3.2-rc3-fast/fs/drop_caches.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/drop_caches.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/drop_caches.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
> >  		return ret;
> >  	if (write) {
> >  		if (sysctl_drop_caches & 1)
> > -			iterate_supers(drop_pagecache_sb, NULL);
> > +			iterate_supers(drop_pagecache_sb, NULL, true);
> >  		if (sysctl_drop_caches & 2)
> >  			drop_slab();
> >  	}
> > Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c	2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/security/selinux/hooks.c	2011-11-25 05:51:36.000000000 +0100
> > @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
> >  
> >  	/* Set up any superblocks initialized prior to the policy load. */
> >  	printk(KERN_DEBUG "SELinux:  Setting up existing superblocks.\n");
> > -	iterate_supers(delayed_superblock_init, NULL);
> > +	iterate_supers(delayed_superblock_init, NULL, true);
> >  }
> >  
> >  /* SELinux requires early initialization in order to label
> > Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> > ===================================================================
> > --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c	2011-11-29 00:09:30.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/fs-writeback.c	2011-11-29 00:12:49.000000000 +0100
> > @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> >  		down_read(&sb->s_umount);
> > +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +			up_read(&sb->s_umount);
> > +			return 0;
> > +		}
> >  		writeback_inodes_sb(sb, reason);
> >  		up_read(&sb->s_umount);
> >  		return 1;
> > @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> >  		down_read(&sb->s_umount);
> > +		if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > +			up_read(&sb->s_umount);
> > +			return 0;
> > +		}
> >  		writeback_inodes_sb_nr(sb, nr, reason);
> >  		up_read(&sb->s_umount);
> >  		return 1;
> -- 
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel