Re: [POC/RFC PATCH] overlayfs: fix data inconsistency at copy up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 12, 2016 at 03:33:26PM +0200, Miklos Szeredi wrote:
> This is a proof of concept patch to fix the following.
> 
> /ovl is in overlay mount and /ovl/foo exists on the lower layer only.
> 
>  rofd = open("/ovl/foo", O_RDONLY);
>  rwfd = open("/ovl/foo", O_WRONLY); /* this causes copy up */
>  write(rwfd, "bar", 3);
>  read(rofd, buf, 3); 
>  assert(memcmp(buf, "bar", 3) == 0);
> 
> Similar problem exists with an MAP_SHARED mmap created from rofd.
> 
> While this has only caused few problems (yum/dnf failure is the only one I know
> of) and easily worked around in userspace, many see it as a proof that overlayfs
> can never be a proper "POSIX" filesystem.
> 
> To quell those worries, here's a simple patch that should address the above.
> 
> The only VFS change is that f_op is initialized from f_path.dentry->d_inode
> instead of file_inode(filp) in open.  The effect of this is that overlayfs can
> intercept open and other file operations, while the file still effectively
> belongs to the underlying fs.
> 
> The patch does not give up on the nice properties of overlayfs, like sharing the
> page cache with the underlying files.  It does cause copy up in one case where
> previously there wasn't one and that's the O_RDONLY/MAP_SHARED case.  I haven't
> done much research into this, but running some tests in chroot didn't trigger
> this.
> 
> Comments, testing are welcome.

Hi Miklos, 

This looks like a very interesting idea. In fact once file has been copied
up and writen to, and if I do fstat(rofd), it shows the size of copied up
file but one can read the contents. So fixing that anomaly would be nice.

Hopefully O_RDONLY/MAP_SHARED is not a common case and we get away with
this forced copy up penalty.

[..]
> +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry));
> +	ssize_t ret = -EINVAL;
> +
> +	if (likely(!isupper)) {
> +		const struct file_operations *fop = ovl_real_fop(file);
> +
> +		if (likely(fop->read_iter))
> +			ret = fop->read_iter(iocb, to);
> +	} else {
> +		struct file *upperfile = filp_clone_open(file);
> +

IIUC, every read of lower file will call filp_clone_open(). Looking at the
code of filp_clone_open(), I am concerned about the overhead of this call.
Is it significant? Don't want to be paying too much of penalty for read
operation on lower files. That would be a common case for containers.

BTW, I did a quick testing. Using docker launched a fedora container and
called "dnf update" inside that. And later I noticed following on serial
console.

Thanks
Vivek

[  309.075885] ======================================================
[  309.076841] [ INFO: possible circular locking dependency detected ]
[  309.077818] 4.9.0-rc1+ #197 Not tainted
[  309.078411] -------------------------------------------------------
[  309.079377] dnf/2468 is trying to acquire lock:
[  309.080082]  ([  309.080324] &type->s_vfs_rename_key
#2[  309.080942] ){+.+.+.}
, at: [  309.081435] [<ffffffff8129f652>] lock_rename+0x32/0x100
[  309.082261] 
[  309.082261] but task is already holding lock:
[  309.083158]  ([  309.083399] &mm->mmap_sem
){++++++}[  309.083974] , at: 
[  309.084316] [<ffffffff8121df0c>] vm_mmap_pgoff+0x8c/0x100
[  309.085150] 
[  309.085150] which lock already depends on the new lock.
[  309.085150] 
[  309.086393] 
[  309.086393] the existing dependency chain (in reverse order) is:
[  309.088279] 
-> #3[  309.088612]  (
&mm->mmap_sem[  309.089091] ){++++++}
[  309.089470] :
[  309.089735]        [  309.090046] [<ffffffff81112786>] lock_acquire+0xf6/0x1f0
[  309.090884]        [  309.091197] [<ffffffff812316c0>] __might_fault+0x70/0xa0
[  309.092047]        [  309.092357] [<ffffffff812aa585>] filldir+0xb5/0x140
[  309.093128]        [  309.093434] [<ffffffff8132f235>] call_filldir+0x65/0x130
[  309.094273]        [  309.094590] [<ffffffff8132fc0f>] ext4_readdir+0x6cf/0x8a0
[  309.095425]        [  309.095742] [<ffffffff812aa24b>] iterate_dir+0x17b/0x1b0
[  309.096572]        [  309.096878] [<ffffffff812aa76c>] SyS_getdents+0x9c/0x130
[  309.097716]        [  309.098026] [<ffffffff818a4bc1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  309.099005] 
-> #2[  309.099304]  (
&type->i_mutex_dir_key[  309.099888] #3
){++++++}[  309.100301] :
[  309.100576]        [  309.100881] [<ffffffff81112786>] lock_acquire+0xf6/0x1f0
[  309.101711]        [  309.102017] [<ffffffff818a1279>] down_write+0x49/0x80
[  309.102798]        [  309.103098] [<ffffffff8129fea5>] vfs_rmdir+0x55/0x140
[  309.103878]        [  309.104179] [<ffffffff812a5bdd>] do_rmdir+0x1bd/0x230
[  309.104958]        [  309.105256] [<ffffffff812a6a12>] SyS_unlinkat+0x22/0x30
[  309.106063]        [  309.106364] [<ffffffff818a4bc1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  309.107345] 
-> #1[  309.107661]  (
&type->i_mutex_dir_key[  309.108256] #3
/1[  309.108597] ){+.+.+.}
[  309.108971] :
[  309.109225]        [  309.109532] [<ffffffff81112786>] lock_acquire+0xf6/0x1f0
[  309.110370]        [  309.110686] [<ffffffff8110c8df>] down_write_nested+0x4f/0x80
[  309.111606]        [  309.111916] [<ffffffff8129f701>] lock_rename+0xe1/0x100
[  309.112752]        [  309.113062] [<ffffffff812a7842>] SyS_renameat+0x212/0x3f0
[  309.113918]        [  309.114224] [<ffffffff818a4bc1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  309.115218] 
-> #0[  309.115525]  (
&type->s_vfs_rename_key[  309.116138] #2
){+.+.+.}[  309.116564] :
[  309.116837]        [  309.117146] [<ffffffff811121a0>] __lock_acquire+0x1110/0x12a0
[  309.118047]        [  309.118354] [<ffffffff81112786>] lock_acquire+0xf6/0x1f0
[  309.119193]        [  309.119500] [<ffffffff818a0159>] mutex_lock_nested+0x79/0x3c0
[  309.120406]        [  309.120723] [<ffffffff8129f652>] lock_rename+0x32/0x100
[  309.121564]        [  309.121875] [<ffffffffa02d1617>] ovl_copy_up_one+0xf7/0x6a0 [overlay]
[  309.122866]        [  309.123170] [<ffffffffa02d1ccb>] ovl_copy_up+0x10b/0x13d [overlay]
[  309.124136]        [  309.124442] [<ffffffffa02cd7a5>] ovl_mmap+0x55/0x90 [overlay]
[  309.125348]        [  309.125677] [<ffffffff8123f3a4>] mmap_region+0x394/0x630
[  309.126510]        [  309.126829] [<ffffffff8123fa86>] do_mmap+0x446/0x530
[  309.127616]        [  309.127928] [<ffffffff8121df3d>] vm_mmap_pgoff+0xbd/0x100
[  309.128783]        [  309.129092] [<ffffffff8123d5c1>] SyS_mmap_pgoff+0x1c1/0x290
[  309.129973]        [  309.130284] [<ffffffff8103b71b>] SyS_mmap+0x1b/0x30
[  309.131058]        [  309.131368] [<ffffffff818a4bc1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  309.132377] 
[  309.132377] other info that might help us debug this:
[  309.132377] 
[  309.133608] Chain exists of:
  [  309.134084] &type->s_vfs_rename_key
#2[  309.134694]  --> 
&type->i_mutex_dir_key[  309.135324] #3
 --> [  309.135705] &mm->mmap_sem
[  309.136131] 
[  309.136131] 
[  309.136599]  Possible unsafe locking scenario:
[  309.136599] 
[  309.137499]        CPU0                    CPU1
[  309.138202]        ----                    ----
[  309.138905]   lock([  309.139211] &mm->mmap_sem
[  309.139646] );
[  309.139914]                                lock([  309.140602] &type->i_mutex_dir_key
#3[  309.141190] );
[  309.141472]                                lock([  309.142175] &mm->mmap_sem
[  309.142610] );
[  309.142877]   lock([  309.143186] &type->s_vfs_rename_key
#2[  309.143796] );
[  309.144084] 
[  309.144084]  *** DEADLOCK ***
[  309.144084] 
[  309.144991] 1 lock held by dnf/2468:
[  309.145543]  #0: [  309.145827]  (
&mm->mmap_sem[  309.146299] ){++++++}
, at: [  309.146782] [<ffffffff8121df0c>] vm_mmap_pgoff+0x8c/0x100
[  309.147634] 
[  309.147634] stack backtrace:
[  309.148310] CPU: 4 PID: 2468 Comm: dnf Not tainted 4.9.0-rc1+ #197
[  309.149257] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  309.150717]  ffffc900033cf900 ffffffff8144b253 ffffffff828b29f0 ffffffff828e8d50
[  309.151942]  ffffc900033cf940 ffffffff8110f83e 00000000f7ac0000 ffff8801f7ac0860
[  309.153660]  ffff8801f7ac0000 ffff8801f7ac0838 0000000000000001 0000000000000000
[  309.154877] Call Trace:
[  309.155266]  [<ffffffff8144b253>] dump_stack+0x86/0xc3
[  309.156062]  [<ffffffff8110f83e>] print_circular_bug+0x1be/0x210
[  309.156991]  [<ffffffff811121a0>] __lock_acquire+0x1110/0x12a0
[  309.157896]  [<ffffffff8112ee3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[  309.158912]  [<ffffffff813c3ed4>] ? avc_has_perm+0x34/0x290
[  309.159768]  [<ffffffff81112786>] lock_acquire+0xf6/0x1f0
[  309.160597]  [<ffffffff8129f652>] ? lock_rename+0x32/0x100
[  309.161438]  [<ffffffff818a0159>] mutex_lock_nested+0x79/0x3c0
[  309.162351]  [<ffffffff8129f652>] ? lock_rename+0x32/0x100
[  309.163202]  [<ffffffff813c92fb>] ? selinux_inode_getattr+0x8b/0xb0
[  309.164162]  [<ffffffff8129f652>] lock_rename+0x32/0x100
[  309.164981]  [<ffffffffa02d1617>] ovl_copy_up_one+0xf7/0x6a0 [overlay]
[  309.165987]  [<ffffffff813c3fd3>] ? avc_has_perm+0x133/0x290
[  309.166864]  [<ffffffff813c3ed4>] ? avc_has_perm+0x34/0x290
[  309.167723]  [<ffffffff810e348a>] ? __might_sleep+0x4a/0x80
[  309.168580]  [<ffffffffa02d1ccb>] ovl_copy_up+0x10b/0x13d [overlay]
[  309.169539]  [<ffffffffa02cd7a5>] ovl_mmap+0x55/0x90 [overlay]
[  309.170436]  [<ffffffff8123f3a4>] mmap_region+0x394/0x630
[  309.171269]  [<ffffffff8123fa86>] do_mmap+0x446/0x530
[  309.172064]  [<ffffffff8121df3d>] vm_mmap_pgoff+0xbd/0x100
[  309.172908]  [<ffffffff8123d5c1>] SyS_mmap_pgoff+0x1c1/0x290
[  309.173777]  [<ffffffff8103b71b>] SyS_mmap+0x1b/0x30
[  309.174539]  [<ffffffff818a4bc1>] entry_SYSCALL_64_fastpath+0x1f/0xc2

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux