Re: [PATCH] ceph: only dirty ITER_IOVEC pages for direct read

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
> >> (CC'ing Jeff)
> >>
> >> Ilya Dryomov <idryomov@xxxxxxxxx> writes:
> >>
> >> > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
> >> > >
> >> > >
> >> > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >> > > >
> >> > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
> >> > > > >
> >> > > > >
> >> > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >> > > > > >
> >> > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
> >> > > > > >
> >> > > > > > Hi Zheng,
> >> > > > > >
> >> > > > > > Can you explain how this fixes the invalid memory access reported in
> >> > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
> >> > > > > > with a different reproducer?
> >> > > > > >
> >> > > > >
> >> > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in
> >> > > > > this case. I compared the code with other filesystem, found recent commits
> >> > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also
> >> > > > > make KASAN silent.
> >> > > >
> >> > > > Did you manage to reproduce that KASAN report?
> >> > > >
> >> > >
> >> > > no
> >> > >
> >> > > > If it is a false positive, the patch shouldn't have been marked for
> >> > > > stable.
> >> > > >
> >> > >
> >> > > I CCed stable because fuse did the same
> >> > >
> >> > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee
> >> > > Author: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> >> > > Date:   Wed Aug 24 18:17:04 2016 +0200
> >> > >
> >> > >     fuse: direct-io: don't dirty ITER_BVEC pages
> >> > >
> >> > >     When reading from a loop device backed by a fuse file it deadlocks on
> >> > >     lock_page().
> >> > >
> >> > >     This is because the page is already locked by the read() operation done on
> >> > >     the loop device.  In this case we don't want to either lock the page or
> >> > >     dirty it.
> >> > >
> >> > >     So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
> >> >
> >> > OK, so it is a potential deadlock which has nothing to do with that
> >> > KASAN report.  I have rewritten the changelog to reflect that, please
> >> > take a look:
> >> >
> >> >   https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
> >>
> >> Just as a side note, it's still trivial to get a cephfs-related KASAN
> >> report by running a fuzzer (trinity in my case) against a mainline
> >> kernel 4.16.0-rc7 with this fix backported.
> >>
> >
> > I suspect trinity is clobbering the array
> >> As Jeff mentioned in a different thread, splice syscall is broken on
> >> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
> >> ceph-client testing branch.  I still think there's some bug in Jeff's
> >> fix as I still see a crash occasionally, but I haven't yet finished
> >> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
> >> stable kernels (but may I'm wrong).
> >>
> >> Cheers,
> >
> > (cc'ing Al)
> >
> > Yeah, I should have revisited this a while ago. So the deal is that I
> > posted this patch upstream around a year ago, but Al didn't really like
> > it:
> >
> >     https://patchwork.kernel.org/patch/9541829/
> >
> > He wanted to add some iov_iter_for_each_page infrastructure and base
> > the new function on that. I had thought he had something queued up
> > along those lines at one point, but I don't see anything in mainline so
> > far.
> >
> > There is a iov_iter_for_each_range, but it doesn't handle userland
> > iovecs and doesn't seem to care about alignment. I think we'll need
> > both here.
> >
> > I'm not sure however that a per-page callback is really that helpful
> > for callers like ceph_direct_read_write. We'll just end up having to do
> > more  work in the fs to batch up the pages before we add the op to the
> > req.
> >
> > iov_iter_get_pages_alloc is really sort of what we want in this case,
> > and we want that function to stitch together longer arrays when it can.
> >
> > Al, do you have any thoughts on what we should do here?
> 
> I think this can be fixed entirely in ceph by walking away from page
> vectors (barely tested):
> 
>   https://github.com/ceph/ceph-client/commits/wip-splice
> 
> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
> "make a copy of the iterator and truncate it").  Otherwise we will
> probably merge it into ceph, as this code has remained broken for way
> too long...
> 
> Attached is the main patch from wip-splice.
> 
> Thanks,
> 
>                 Ilya

I haven't (yet) spent much time looking at the code, but a quick test
shows me the below failure in the sanity() tests from lib/iov_iter.c.

Cheers,
--
Luís


[  235.134588] idx = 7, offset = 4096
[  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
[  235.135388] [          (null) 00000000a5376b82 0 4096]
[  235.135914] [          (null) 00000000fb52416a 0 4096]
[  235.136421] [          (null) 0000000009c7c0fc 0 4096]
[  235.136901] [          (null) 000000008b9911f0 0 4096]
[  235.137411] [          (null) 0000000042e4a3e2 0 4096]
[  235.137879] [          (null) 00000000110f4fc8 0 4096]
[  235.138358] [          (null) 00000000e16bdd11 0 4096]
[  235.138910] [          (null) 000000007d61a8f6 0 4096]
[  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
[  235.139879] [          (null) 00000000c55231a4 0 4096]
[  235.140378] [          (null) 00000000eba243d2 0 4096]
[  235.140851] [          (null) 00000000f5617952 0 4096]
[  235.141344] [          (null) 00000000fec2c691 0 4096]
[  235.141836] [          (null) 00000000ca09f9a4 0 4096]
[  235.142360] [          (null) 0000000087cfbb92 0 4096]
[  235.142886] [          (null) 000000009432c839 0 4096]
[  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
[  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
[  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[  235.146057] RIP: 0010:sanity+0x162/0x210
[  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
[  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
[  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
[  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
[  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
[  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
[  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
[  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
[  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
[  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
[  235.154644] Call Trace:
[  235.154914]  iov_iter_get_pages+0x2a3/0x680
[  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
[  235.155882]  ? iov_iter_advance+0x360/0x780
[  235.156334]  ? generic_file_splice_read+0x21b/0x320
[  235.156993]  ? SyS_splice+0x914/0x9a0
[  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
[  235.158957]  ? kmemleak_disable+0x70/0x70
[  235.159456]  __iter_get_bvecs+0x117/0x290
[  235.159955]  ? ceph_fallocate+0xc10/0xc10
[  235.160455]  ? iov_iter_npages+0x309/0x540
[  235.160974]  ? __kmalloc+0xf9/0x1f0
[  235.161410]  ceph_direct_read_write+0x4e7/0x1380
[  235.161981]  ? ceph_aio_complete_req+0x750/0x750
[  235.162503]  ? ceph_write_iter+0x103b/0x1530
[  235.162986]  ? __rb_erase_color+0xd50/0xd50
[  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
[  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
[  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
[  235.165713]  ? ceph_get_caps+0x2bc/0x700
[  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
[  235.167182]  ? __delete_object+0xcd/0x100
[  235.167636]  ? put_object+0x40/0x40
[  235.168033]  ? up_read+0x20/0x20
[  235.168396]  ? ceph_write_iter+0x103b/0x1530
[  235.168865]  ? kmem_cache_free+0x80/0x1d0
[  235.169252]  ? ceph_write_iter+0x1047/0x1530
[  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
[  235.170097]  ceph_read_iter+0xc3e/0x10b0
[  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
[  235.171138]  ? ceph_write_iter+0x1530/0x1530
[  235.171667]  ? timerqueue_add+0xd2/0x100
[  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
[  235.172789]  ? hrtimer_reprogram+0x130/0x130
[  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
[  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
[  235.174979]  ? hrtimer_forward+0x120/0x120
[  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
[  235.176033]  ? hrtimer_active+0x240/0x240
[  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
[  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
[  235.177980]  ? generic_file_splice_read+0x21b/0x320
[  235.178694]  ? ceph_write_iter+0x1530/0x1530
[  235.179325]  generic_file_splice_read+0x21b/0x320
[  235.180011]  ? splice_shrink_spd+0x40/0x40
[  235.180619]  ? schedule+0x50/0x280
[  235.181126]  ? rw_verify_area+0x1f/0x70
[  235.181641]  ? do_splice_to+0x6a/0xc0
[  235.182075]  SyS_splice+0x914/0x9a0
[  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
[  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
[  235.183266]  ? do_syscall_64+0x8d/0x300
[  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
[  235.184223]  do_syscall_64+0x164/0x300
[  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
[  235.185213]  ? page_fault+0x25/0x50
[  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
[  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
[  235.186692]  ? syscall_trace_enter+0x320/0x320
[  235.187206]  ? __put_user_4+0x1c/0x30
[  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  235.188726] RIP: 0033:0x7fd750f61229
[  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
[  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
[  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
[  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
[  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
[  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
[  235.197769] ---[ end trace 0228543143d962c4 ]---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux