Re: [PATCH] fs: only sync() superblocks reachable from the current namespace

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 26 Jan 2018 23:13:51 +0000

On Fri, Jan 26, 2018 at 02:58:39PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@xxxxxx>
> 
> Currently, the sync() syscall is system-wide, so any process in a
> container can cause significant I/O stalls across the system by calling
> sync(). This is even true for filesystems which are not accessible in
> the process' mount namespace. This patch scopes sync() to only write out
> filesystems reachable in the current mount namespace, except for the
> initial mount namespace, which still syncs everything to avoid
> surprises. This fixes the broken isolation we were seeing here.

> +static int sb_reachable(struct super_block *sb, struct mnt_namespace *mnt_ns)
> +{
> +	struct mount *mnt;
> +
> +	if (!mnt_ns)
> +		return 1;
> +
> +	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> +		if (mnt->mnt_ns == mnt_ns)
> +			return 1;
> +	}
> +	return 0;
> +}

Erm...  And just what is protecting the list here?

>  static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
> @@ -107,12 +138,18 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
>   */
>  SYSCALL_DEFINE0(sync)
>  {
> -	int nowait = 0, wait = 1;
> +	struct sb_sync arg = {
> +		.mnt_ns = current->nsproxy->mnt_ns,
> +	};
> +
> +	if (arg.mnt_ns == init_task.nsproxy->mnt_ns)
> +		arg.mnt_ns = NULL;
>  
>  	wakeup_flusher_threads(WB_REASON_SYNC);
> -	iterate_supers(sync_inodes_one_sb, NULL);
> -	iterate_supers(sync_fs_one_sb, &nowait);
> -	iterate_supers(sync_fs_one_sb, &wait);
> +	iterate_supers(sync_inodes_one_sb, &arg);
> +	iterate_supers(sync_fs_one_sb, &arg);
> +	arg.wait = 1;
> +	iterate_supers(sync_fs_one_sb, &arg);

So now sync() includes O(total vfsmounts on the system) walking the lists, no
matter what *and* in a situation when a lazy-unmounted filesystem is held active
by an opened file sync(2) won't touch that filesystem.  Unless done in the
magical namespace init(8) happens to run in.