On Fri, Jan 26, 2018 at 02:58:39PM -0800, Omar Sandoval wrote: > From: Omar Sandoval <osandov@xxxxxx> > > Currently, the sync() syscall is system-wide, so any process in a > container can cause significant I/O stalls across the system by calling > sync(). This is even true for filesystems which are not accessible in > the process' mount namespace. This patch scopes sync() to only write out > filesystems reachable in the current mount namespace, except for the > initial mount namespace, which still syncs everything to avoid > surprises. This fixes the broken isolation we were seeing here. > +static int sb_reachable(struct super_block *sb, struct mnt_namespace *mnt_ns) > +{ > + struct mount *mnt; > + > + if (!mnt_ns) > + return 1; > + > + list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { > + if (mnt->mnt_ns == mnt_ns) > + return 1; > + } > + return 0; > +} Erm... And just what is protecting the list here? > static void fdatawrite_one_bdev(struct block_device *bdev, void *arg) > @@ -107,12 +138,18 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg) > */ > SYSCALL_DEFINE0(sync) > { > - int nowait = 0, wait = 1; > + struct sb_sync arg = { > + .mnt_ns = current->nsproxy->mnt_ns, > + }; > + > + if (arg.mnt_ns == init_task.nsproxy->mnt_ns) > + arg.mnt_ns = NULL; > > wakeup_flusher_threads(WB_REASON_SYNC); > - iterate_supers(sync_inodes_one_sb, NULL); > - iterate_supers(sync_fs_one_sb, &nowait); > - iterate_supers(sync_fs_one_sb, &wait); > + iterate_supers(sync_inodes_one_sb, &arg); > + iterate_supers(sync_fs_one_sb, &arg); > + arg.wait = 1; > + iterate_supers(sync_fs_one_sb, &arg); So now sync() includes O(total vfsmounts on the system) walking the lists, no matter what *and* in a situation when a lazy-unmounted filesystem is held active by an opened file sync(2) won't touch that filesystem. Unless done in the magical namespace init(8) happens to run in.