iter_get_bvecs_alloc() copies iov_iter. iov_iter_advance() are coded for both old and new io_iter. I think pipe_advance() did bad thing. Regards Yan, Zheng > On Apr 3, 2018, at 20:46, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2018-04-03 at 09:16 +0100, Luis Henriques wrote: >> On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote: >>> On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote: >>>> On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote: >>>>> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>>>>> On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote: >>>>>>> (CC'ing Jeff) >>>>>>> >>>>>>> Ilya Dryomov <idryomov@xxxxxxxxx> writes: >>>>>>> >>>>>>>> On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Zheng, >>>>>>>>>>>> >>>>>>>>>>>> Can you explain how this fixes the invalid memory access reported in >>>>>>>>>>>> "KASAN: use-after-free Read in set_page_dirty_lock"? Did you come up >>>>>>>>>>>> with a different reproducer? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I don\u2019t know why KASAN reports use-after-free (like false alarm) in >>>>>>>>>>> this case. I compared the code with other filesystem, found recent commits >>>>>>>>>>> make other filesystems only dirty ITER_IOVEC pages. The change should also >>>>>>>>>>> make KASAN silent. >>>>>>>>>> >>>>>>>>>> Did you manage to reproduce that KASAN report? >>>>>>>>>> >>>>>>>>> >>>>>>>>> no >>>>>>>>> >>>>>>>>>> If it is a false positive, the patch shouldn't have been marked for >>>>>>>>>> stable. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I CCed stable because fuse did the same >>>>>>>>> >>>>>>>>> commit 8fba54aebbdf1f999738121922e74bf796ad60ee >>>>>>>>> Author: Miklos Szeredi <mszeredi@xxxxxxxxxx> >>>>>>>>> Date: Wed Aug 24 18:17:04 2016 +0200 >>>>>>>>> >>>>>>>>> fuse: direct-io: don't dirty ITER_BVEC pages >>>>>>>>> >>>>>>>>> When reading from a loop device backed by a fuse file it deadlocks on >>>>>>>>> lock_page(). >>>>>>>>> >>>>>>>>> This is because the page is already locked by the read() operation done on >>>>>>>>> the loop device. In this case we don't want to either lock the page or >>>>>>>>> dirty it. >>>>>>>>> >>>>>>>>> So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors. >>>>>>>> >>>>>>>> OK, so it is a potential deadlock which has nothing to do with that >>>>>>>> KASAN report. I have rewritten the changelog to reflect that, please >>>>>>>> take a look: >>>>>>>> >>>>>>>> https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c >>>>>>> >>>>>>> Just as a side note, it's still trivial to get a cephfs-related KASAN >>>>>>> report by running a fuzzer (trinity in my case) against a mainline >>>>>>> kernel 4.16.0-rc7 with this fix backported. >>>>>>> >>>>>> >>>>>> I suspect trinity is clobbering the array >>>>>>> As Jeff mentioned in a different thread, splice syscall is broken on >>>>>>> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the >>>>>>> ceph-client testing branch. I still think there's some bug in Jeff's >>>>>>> fix as I still see a crash occasionally, but I haven't yet finished >>>>>>> debugging it. Unfortunately, Jeff's fix is probably not appropriate for >>>>>>> stable kernels (but may I'm wrong). >>>>>>> >>>>>>> Cheers, >>>>>> >>>>>> (cc'ing Al) >>>>>> >>>>>> Yeah, I should have revisited this a while ago. So the deal is that I >>>>>> posted this patch upstream around a year ago, but Al didn't really like >>>>>> it: >>>>>> >>>>>> https://patchwork.kernel.org/patch/9541829/ >>>>>> >>>>>> He wanted to add some iov_iter_for_each_page infrastructure and base >>>>>> the new function on that. I had thought he had something queued up >>>>>> along those lines at one point, but I don't see anything in mainline so >>>>>> far. >>>>>> >>>>>> There is a iov_iter_for_each_range, but it doesn't handle userland >>>>>> iovecs and doesn't seem to care about alignment. I think we'll need >>>>>> both here. >>>>>> >>>>>> I'm not sure however that a per-page callback is really that helpful >>>>>> for callers like ceph_direct_read_write. We'll just end up having to do >>>>>> more work in the fs to batch up the pages before we add the op to the >>>>>> req. >>>>>> >>>>>> iov_iter_get_pages_alloc is really sort of what we want in this case, >>>>>> and we want that function to stitch together longer arrays when it can. >>>>>> >>>>>> Al, do you have any thoughts on what we should do here? >>>>> >>>>> I think this can be fixed entirely in ceph by walking away from page >>>>> vectors (barely tested): >>>>> >>>>> https://github.com/ceph/ceph-client/commits/wip-splice >>>>> >>>>> If folks would like to see something like iov_iter_get_bvecs{,_alloc}() >>>>> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs >>>>> and maxsize parameters -- I didn't need maxbvecs and maxsize is just >>>>> "make a copy of the iterator and truncate it"). Otherwise we will >>>>> probably merge it into ceph, as this code has remained broken for way >>>>> too long... >>>>> >>>>> Attached is the main patch from wip-splice. >>>>> >>>>> Thanks, >>>>> >>>>> Ilya >>>> >>>> I haven't (yet) spent much time looking at the code, but a quick test >>>> shows me the below failure in the sanity() tests from lib/iov_iter.c. >>>> >>>> Cheers, >>>> -- >>>> Luís >>>> >>>> >>>> [ 235.134588] idx = 7, offset = 4096 >>>> [ 235.134917] curbuf = 8, nrbufs = 1, buffers = 16 >>>> [ 235.135388] [ (null) 00000000a5376b82 0 4096] >>>> [ 235.135914] [ (null) 00000000fb52416a 0 4096] >>>> [ 235.136421] [ (null) 0000000009c7c0fc 0 4096] >>>> [ 235.136901] [ (null) 000000008b9911f0 0 4096] >>>> [ 235.137411] [ (null) 0000000042e4a3e2 0 4096] >>>> [ 235.137879] [ (null) 00000000110f4fc8 0 4096] >>>> [ 235.138358] [ (null) 00000000e16bdd11 0 4096] >>>> [ 235.138910] [ (null) 000000007d61a8f6 0 4096] >>>> [ 235.139399] [00000000af9d005b 000000003b50ae93 0 4096] >>>> [ 235.139879] [ (null) 00000000c55231a4 0 4096] >>>> [ 235.140378] [ (null) 00000000eba243d2 0 4096] >>>> [ 235.140851] [ (null) 00000000f5617952 0 4096] >>>> [ 235.141344] [ (null) 00000000fec2c691 0 4096] >>>> [ 235.141836] [ (null) 00000000ca09f9a4 0 4096] >>>> [ 235.142360] [ (null) 0000000087cfbb92 0 4096] >>>> [ 235.142886] [ (null) 000000009432c839 0 4096] >>>> [ 235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210 >>>> [ 235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52 >>>> [ 235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 >>>> [ 235.146057] RIP: 0010:sanity+0x162/0x210 >>>> [ 235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246 >>>> [ 235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3 >>>> [ 235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8 >>>> [ 235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0 >>>> [ 235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38 >>>> [ 235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000 >>>> [ 235.151184] FS: 00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000 >>>> [ 235.152028] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0 >>>> [ 235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> [ 235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602 >>>> [ 235.154644] Call Trace: >>>> [ 235.154914] iov_iter_get_pages+0x2a3/0x680 >>>> [ 235.155377] ? iov_iter_gap_alignment+0x490/0x490 >>>> [ 235.155882] ? iov_iter_advance+0x360/0x780 >>>> [ 235.156334] ? generic_file_splice_read+0x21b/0x320 >>>> [ 235.156993] ? SyS_splice+0x914/0x9a0 >>>> [ 235.157498] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>>> [ 235.158177] ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0 >>>> [ 235.158957] ? kmemleak_disable+0x70/0x70 >>>> [ 235.159456] __iter_get_bvecs+0x117/0x290 >>>> [ 235.159955] ? ceph_fallocate+0xc10/0xc10 >>>> [ 235.160455] ? iov_iter_npages+0x309/0x540 >>>> [ 235.160974] ? __kmalloc+0xf9/0x1f0 >>>> [ 235.161410] ceph_direct_read_write+0x4e7/0x1380 >>>> [ 235.161981] ? ceph_aio_complete_req+0x750/0x750 >>>> [ 235.162503] ? ceph_write_iter+0x103b/0x1530 >>>> [ 235.162986] ? __rb_erase_color+0xd50/0xd50 >>>> [ 235.163449] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>>> [ 235.164043] ? ceph_fscache_revalidate_cookie+0x80/0x1d0 >>>> [ 235.164630] ? ceph_fscache_unregister_fs+0x1f0/0x1f0 >>>> [ 235.165201] ? __ceph_mark_dirty_caps+0x212/0x610 >>>> [ 235.165713] ? ceph_get_caps+0x2bc/0x700 >>>> [ 235.166147] ? _raw_spin_unlock_bh+0xc0/0xc0 >>>> [ 235.166681] ? ceph_put_cap_refs+0x5a0/0x5a0 >>>> [ 235.167182] ? __delete_object+0xcd/0x100 >>>> [ 235.167636] ? put_object+0x40/0x40 >>>> [ 235.168033] ? up_read+0x20/0x20 >>>> [ 235.168396] ? ceph_write_iter+0x103b/0x1530 >>>> [ 235.168865] ? kmem_cache_free+0x80/0x1d0 >>>> [ 235.169252] ? ceph_write_iter+0x1047/0x1530 >>>> [ 235.169692] ? ceph_read_iter+0xc3e/0x10b0 >>>> [ 235.170097] ceph_read_iter+0xc3e/0x10b0 >>>> [ 235.170519] ? __tty_buffer_request_room+0xd0/0x320 >>>> [ 235.171138] ? ceph_write_iter+0x1530/0x1530 >>>> [ 235.171667] ? timerqueue_add+0xd2/0x100 >>>> [ 235.172239] ? enqueue_hrtimer+0xfb/0x1d0 >>>> [ 235.172789] ? hrtimer_reprogram+0x130/0x130 >>>> [ 235.173361] ? _raw_spin_unlock_bh+0xc0/0xc0 >>>> [ 235.173913] ? _raw_spin_unlock_bh+0xc0/0xc0 >>>> [ 235.174464] ? hrtimer_start_range_ns+0x32d/0x440 >>>> [ 235.174979] ? hrtimer_forward+0x120/0x120 >>>> [ 235.175394] ? hrtimer_try_to_cancel+0x83/0x250 >>>> [ 235.176033] ? hrtimer_active+0x240/0x240 >>>> [ 235.176630] ? __hrtimer_get_remaining+0xc6/0x100 >>>> [ 235.177285] ? hrtimer_start_range_ns+0x440/0x440 >>>> [ 235.177980] ? generic_file_splice_read+0x21b/0x320 >>>> [ 235.178694] ? ceph_write_iter+0x1530/0x1530 >>>> [ 235.179325] generic_file_splice_read+0x21b/0x320 >>>> [ 235.180011] ? splice_shrink_spd+0x40/0x40 >>>> [ 235.180619] ? schedule+0x50/0x280 >>>> [ 235.181126] ? rw_verify_area+0x1f/0x70 >>>> [ 235.181641] ? do_splice_to+0x6a/0xc0 >>>> [ 235.182075] SyS_splice+0x914/0x9a0 >>>> [ 235.182405] ? __task_pid_nr_ns+0x1a9/0x1c0 >>>> [ 235.182825] ? compat_SyS_vmsplice+0x100/0x100 >>>> [ 235.183266] ? do_syscall_64+0x8d/0x300 >>>> [ 235.183701] ? compat_SyS_vmsplice+0x100/0x100 >>>> [ 235.184223] do_syscall_64+0x164/0x300 >>>> [ 235.184662] ? syscall_return_slowpath+0x1c0/0x1c0 >>>> [ 235.185213] ? page_fault+0x25/0x50 >>>> [ 235.185616] ? syscall_return_slowpath+0x13c/0x1c0 >>>> [ 235.186124] ? prepare_exit_to_usermode+0xdb/0x140 >>>> [ 235.186692] ? syscall_trace_enter+0x320/0x320 >>>> [ 235.187206] ? __put_user_4+0x1c/0x30 >>>> [ 235.187858] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>>> [ 235.188726] RIP: 0033:0x7fd750f61229 >>>> [ 235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113 >>>> [ 235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229 >>>> [ 235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098 >>>> [ 235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001 >>>> [ 235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113 >>>> [ 235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000 >>>> [ 235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d >>>> [ 235.197769] ---[ end trace 0228543143d962c4 ]--- >>> >>> Interesting. It really looks like trinity is thrashing the pipe in such >>> a way to create races while we're trying to iterate over it? ceph is >>> just calling pipe_get_pages, but something is hosed down inside the pipe >>> tracking and it fails before we even get to doing anything with the >>> pipe. >>> >>> I'm not that familiar with the pipe code, so I'm not sure what prevents >>> that info in the pipe inode from changing while we're trying to sanity >>> check it here. >>> > > Looks like the pipe_lock is what protects it, and the splice code takes > that at a fairly high level. I think the problem is elsewhere. > >>> Does this testcase work properly on other filesystems? >> >> I can't reproduce it on btrfs or ext4, but it's possible that it's just >> more difficult to trigger. >> > > I can reproduce what you're seeing with trinity. It fails almost > immediately for me on cephfs, but runs more or less indefinitely on xfs. > That suggests that there is something wrong with what's currently in > ceph-client. > > What's there _does_ seem to fix xfstests generic/095, which implies that > there some raciness involved. > > One interesting bit: > > [ 235.134588] idx = 7, offset = 4096 > [ 235.134917] curbuf = 8, nrbufs = 1, buffers = 16 > > The offset always seems to be 4096 when I see this problem. I'm not sure > if that's significant yet. > > -- > Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html