On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote: > On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote: > > On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote: > > > On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote: > > > > > (CC'ing Jeff) > > > > > > > > > > Ilya Dryomov <idryomov@xxxxxxxxx> writes: > > > > > > > > > > > On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > Hi Zheng, > > > > > > > > > > > > > > > > > > > > Can you explain how this fixes the invalid memory access reported in > > > > > > > > > > "KASAN: use-after-free Read in set_page_dirty_lock"? Did you come up > > > > > > > > > > with a different reproducer? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don\u2019t know why KASAN reports use-after-free (like false alarm) in > > > > > > > > > this case. I compared the code with other filesystem, found recent commits > > > > > > > > > make other filesystems only dirty ITER_IOVEC pages. The change should also > > > > > > > > > make KASAN silent. > > > > > > > > > > > > > > > > Did you manage to reproduce that KASAN report? > > > > > > > > > > > > > > > > > > > > > > no > > > > > > > > > > > > > > > If it is a false positive, the patch shouldn't have been marked for > > > > > > > > stable. > > > > > > > > > > > > > > > > > > > > > > I CCed stable because fuse did the same > > > > > > > > > > > > > > commit 8fba54aebbdf1f999738121922e74bf796ad60ee > > > > > > > Author: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > > > > > > Date: Wed Aug 24 18:17:04 2016 +0200 > > > > > > > > > > > > > > fuse: direct-io: don't dirty ITER_BVEC pages > > > > > > > > > > > > > > When reading from a loop device backed by a fuse file it deadlocks on > > > > > > > lock_page(). > > > > > > > > > > > > > > This is because the page is already locked by the read() operation done on > > > > > > > the loop device. In this case we don't want to either lock the page or > > > > > > > dirty it. > > > > > > > > > > > > > > So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors. > > > > > > > > > > > > OK, so it is a potential deadlock which has nothing to do with that > > > > > > KASAN report. I have rewritten the changelog to reflect that, please > > > > > > take a look: > > > > > > > > > > > > https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c > > > > > > > > > > Just as a side note, it's still trivial to get a cephfs-related KASAN > > > > > report by running a fuzzer (trinity in my case) against a mainline > > > > > kernel 4.16.0-rc7 with this fix backported. > > > > > > > > > > > > > I suspect trinity is clobbering the array > > > > > As Jeff mentioned in a different thread, splice syscall is broken on > > > > > cephfs and the fix for it is still tagged as "DO NOT MERGE" in the > > > > > ceph-client testing branch. I still think there's some bug in Jeff's > > > > > fix as I still see a crash occasionally, but I haven't yet finished > > > > > debugging it. Unfortunately, Jeff's fix is probably not appropriate for > > > > > stable kernels (but may I'm wrong). > > > > > > > > > > Cheers, > > > > > > > > (cc'ing Al) > > > > > > > > Yeah, I should have revisited this a while ago. So the deal is that I > > > > posted this patch upstream around a year ago, but Al didn't really like > > > > it: > > > > > > > > https://patchwork.kernel.org/patch/9541829/ > > > > > > > > He wanted to add some iov_iter_for_each_page infrastructure and base > > > > the new function on that. I had thought he had something queued up > > > > along those lines at one point, but I don't see anything in mainline so > > > > far. > > > > > > > > There is a iov_iter_for_each_range, but it doesn't handle userland > > > > iovecs and doesn't seem to care about alignment. I think we'll need > > > > both here. > > > > > > > > I'm not sure however that a per-page callback is really that helpful > > > > for callers like ceph_direct_read_write. We'll just end up having to do > > > > more work in the fs to batch up the pages before we add the op to the > > > > req. > > > > > > > > iov_iter_get_pages_alloc is really sort of what we want in this case, > > > > and we want that function to stitch together longer arrays when it can. > > > > > > > > Al, do you have any thoughts on what we should do here? > > > > > > I think this can be fixed entirely in ceph by walking away from page > > > vectors (barely tested): > > > > > > https://github.com/ceph/ceph-client/commits/wip-splice > > > > > > If folks would like to see something like iov_iter_get_bvecs{,_alloc}() > > > in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs > > > and maxsize parameters -- I didn't need maxbvecs and maxsize is just > > > "make a copy of the iterator and truncate it"). Otherwise we will > > > probably merge it into ceph, as this code has remained broken for way > > > too long... > > > > > > Attached is the main patch from wip-splice. > > > > > > Thanks, > > > > > > Ilya > > > > I haven't (yet) spent much time looking at the code, but a quick test > > shows me the below failure in the sanity() tests from lib/iov_iter.c. > > > > Cheers, > > -- > > Luís > > > > > > [ 235.134588] idx = 7, offset = 4096 > > [ 235.134917] curbuf = 8, nrbufs = 1, buffers = 16 > > [ 235.135388] [ (null) 00000000a5376b82 0 4096] > > [ 235.135914] [ (null) 00000000fb52416a 0 4096] > > [ 235.136421] [ (null) 0000000009c7c0fc 0 4096] > > [ 235.136901] [ (null) 000000008b9911f0 0 4096] > > [ 235.137411] [ (null) 0000000042e4a3e2 0 4096] > > [ 235.137879] [ (null) 00000000110f4fc8 0 4096] > > [ 235.138358] [ (null) 00000000e16bdd11 0 4096] > > [ 235.138910] [ (null) 000000007d61a8f6 0 4096] > > [ 235.139399] [00000000af9d005b 000000003b50ae93 0 4096] > > [ 235.139879] [ (null) 00000000c55231a4 0 4096] > > [ 235.140378] [ (null) 00000000eba243d2 0 4096] > > [ 235.140851] [ (null) 00000000f5617952 0 4096] > > [ 235.141344] [ (null) 00000000fec2c691 0 4096] > > [ 235.141836] [ (null) 00000000ca09f9a4 0 4096] > > [ 235.142360] [ (null) 0000000087cfbb92 0 4096] > > [ 235.142886] [ (null) 000000009432c839 0 4096] > > [ 235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210 > > [ 235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52 > > [ 235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 > > [ 235.146057] RIP: 0010:sanity+0x162/0x210 > > [ 235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246 > > [ 235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3 > > [ 235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8 > > [ 235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0 > > [ 235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38 > > [ 235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000 > > [ 235.151184] FS: 00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000 > > [ 235.152028] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0 > > [ 235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602 > > [ 235.154644] Call Trace: > > [ 235.154914] iov_iter_get_pages+0x2a3/0x680 > > [ 235.155377] ? iov_iter_gap_alignment+0x490/0x490 > > [ 235.155882] ? iov_iter_advance+0x360/0x780 > > [ 235.156334] ? generic_file_splice_read+0x21b/0x320 > > [ 235.156993] ? SyS_splice+0x914/0x9a0 > > [ 235.157498] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [ 235.158177] ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0 > > [ 235.158957] ? kmemleak_disable+0x70/0x70 > > [ 235.159456] __iter_get_bvecs+0x117/0x290 > > [ 235.159955] ? ceph_fallocate+0xc10/0xc10 > > [ 235.160455] ? iov_iter_npages+0x309/0x540 > > [ 235.160974] ? __kmalloc+0xf9/0x1f0 > > [ 235.161410] ceph_direct_read_write+0x4e7/0x1380 > > [ 235.161981] ? ceph_aio_complete_req+0x750/0x750 > > [ 235.162503] ? ceph_write_iter+0x103b/0x1530 > > [ 235.162986] ? __rb_erase_color+0xd50/0xd50 > > [ 235.163449] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [ 235.164043] ? ceph_fscache_revalidate_cookie+0x80/0x1d0 > > [ 235.164630] ? ceph_fscache_unregister_fs+0x1f0/0x1f0 > > [ 235.165201] ? __ceph_mark_dirty_caps+0x212/0x610 > > [ 235.165713] ? ceph_get_caps+0x2bc/0x700 > > [ 235.166147] ? _raw_spin_unlock_bh+0xc0/0xc0 > > [ 235.166681] ? ceph_put_cap_refs+0x5a0/0x5a0 > > [ 235.167182] ? __delete_object+0xcd/0x100 > > [ 235.167636] ? put_object+0x40/0x40 > > [ 235.168033] ? up_read+0x20/0x20 > > [ 235.168396] ? ceph_write_iter+0x103b/0x1530 > > [ 235.168865] ? kmem_cache_free+0x80/0x1d0 > > [ 235.169252] ? ceph_write_iter+0x1047/0x1530 > > [ 235.169692] ? ceph_read_iter+0xc3e/0x10b0 > > [ 235.170097] ceph_read_iter+0xc3e/0x10b0 > > [ 235.170519] ? __tty_buffer_request_room+0xd0/0x320 > > [ 235.171138] ? ceph_write_iter+0x1530/0x1530 > > [ 235.171667] ? timerqueue_add+0xd2/0x100 > > [ 235.172239] ? enqueue_hrtimer+0xfb/0x1d0 > > [ 235.172789] ? hrtimer_reprogram+0x130/0x130 > > [ 235.173361] ? _raw_spin_unlock_bh+0xc0/0xc0 > > [ 235.173913] ? _raw_spin_unlock_bh+0xc0/0xc0 > > [ 235.174464] ? hrtimer_start_range_ns+0x32d/0x440 > > [ 235.174979] ? hrtimer_forward+0x120/0x120 > > [ 235.175394] ? hrtimer_try_to_cancel+0x83/0x250 > > [ 235.176033] ? hrtimer_active+0x240/0x240 > > [ 235.176630] ? __hrtimer_get_remaining+0xc6/0x100 > > [ 235.177285] ? hrtimer_start_range_ns+0x440/0x440 > > [ 235.177980] ? generic_file_splice_read+0x21b/0x320 > > [ 235.178694] ? ceph_write_iter+0x1530/0x1530 > > [ 235.179325] generic_file_splice_read+0x21b/0x320 > > [ 235.180011] ? splice_shrink_spd+0x40/0x40 > > [ 235.180619] ? schedule+0x50/0x280 > > [ 235.181126] ? rw_verify_area+0x1f/0x70 > > [ 235.181641] ? do_splice_to+0x6a/0xc0 > > [ 235.182075] SyS_splice+0x914/0x9a0 > > [ 235.182405] ? __task_pid_nr_ns+0x1a9/0x1c0 > > [ 235.182825] ? compat_SyS_vmsplice+0x100/0x100 > > [ 235.183266] ? do_syscall_64+0x8d/0x300 > > [ 235.183701] ? compat_SyS_vmsplice+0x100/0x100 > > [ 235.184223] do_syscall_64+0x164/0x300 > > [ 235.184662] ? syscall_return_slowpath+0x1c0/0x1c0 > > [ 235.185213] ? page_fault+0x25/0x50 > > [ 235.185616] ? syscall_return_slowpath+0x13c/0x1c0 > > [ 235.186124] ? prepare_exit_to_usermode+0xdb/0x140 > > [ 235.186692] ? syscall_trace_enter+0x320/0x320 > > [ 235.187206] ? __put_user_4+0x1c/0x30 > > [ 235.187858] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [ 235.188726] RIP: 0033:0x7fd750f61229 > > [ 235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113 > > [ 235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229 > > [ 235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098 > > [ 235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001 > > [ 235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113 > > [ 235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000 > > [ 235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d > > [ 235.197769] ---[ end trace 0228543143d962c4 ]--- > > Interesting. It really looks like trinity is thrashing the pipe in such > a way to create races while we're trying to iterate over it? ceph is > just calling pipe_get_pages, but something is hosed down inside the pipe > tracking and it fails before we even get to doing anything with the > pipe. > > I'm not that familiar with the pipe code, so I'm not sure what prevents > that info in the pipe inode from changing while we're trying to sanity > check it here. > > Does this testcase work properly on other filesystems? I can't reproduce it on btrfs or ext4, but it's possible that it's just more difficult to trigger. Cheers, -- Luís -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html