Re: [PATCH] ceph: only dirty ITER_IOVEC pages for direct read

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



iter_get_bvecs_alloc() copies iov_iter.  iov_iter_advance() are coded for both old and new io_iter.  I think pipe_advance() did bad thing.

Regards
Yan, Zheng


> On Apr 3, 2018, at 20:46, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Tue, 2018-04-03 at 09:16 +0100, Luis Henriques wrote:
>> On Mon, Apr 02, 2018 at 12:27:28PM -0400, Jeff Layton wrote:
>>> On Mon, 2018-04-02 at 15:36 +0100, Luis Henriques wrote:
>>>> On Sun, Apr 01, 2018 at 06:33:31PM +0200, Ilya Dryomov wrote:
>>>>> On Fri, Mar 30, 2018 at 9:14 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>>> On Fri, 2018-03-30 at 11:24 +0100, Luis Henriques wrote:
>>>>>>> (CC'ing Jeff)
>>>>>>> 
>>>>>>> Ilya Dryomov <idryomov@xxxxxxxxx> writes:
>>>>>>> 
>>>>>>>> On Fri, Mar 30, 2018 at 9:26 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 30 Mar 2018, at 15:20, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 30, 2018 at 5:21 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 30 Mar 2018, at 00:49, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Mar 16, 2018 at 4:32 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Zheng,
>>>>>>>>>>>> 
>>>>>>>>>>>> Can you explain how this fixes the invalid memory access reported in
>>>>>>>>>>>> "KASAN: use-after-free Read in set_page_dirty_lock"?  Did you come up
>>>>>>>>>>>> with a different reproducer?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I don\u2019t know why KASAN reports use-after-free (like false alarm) in
>>>>>>>>>>> this case. I compared the code with other filesystem, found recent commits
>>>>>>>>>>> make other filesystems only dirty ITER_IOVEC pages. The change should also
>>>>>>>>>>> make KASAN silent.
>>>>>>>>>> 
>>>>>>>>>> Did you manage to reproduce that KASAN report?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> no
>>>>>>>>> 
>>>>>>>>>> If it is a false positive, the patch shouldn't have been marked for
>>>>>>>>>> stable.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I CCed stable because fuse did the same
>>>>>>>>> 
>>>>>>>>> commit 8fba54aebbdf1f999738121922e74bf796ad60ee
>>>>>>>>> Author: Miklos Szeredi <mszeredi@xxxxxxxxxx>
>>>>>>>>> Date:   Wed Aug 24 18:17:04 2016 +0200
>>>>>>>>> 
>>>>>>>>>    fuse: direct-io: don't dirty ITER_BVEC pages
>>>>>>>>> 
>>>>>>>>>    When reading from a loop device backed by a fuse file it deadlocks on
>>>>>>>>>    lock_page().
>>>>>>>>> 
>>>>>>>>>    This is because the page is already locked by the read() operation done on
>>>>>>>>>    the loop device.  In this case we don't want to either lock the page or
>>>>>>>>>    dirty it.
>>>>>>>>> 
>>>>>>>>>    So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.
>>>>>>>> 
>>>>>>>> OK, so it is a potential deadlock which has nothing to do with that
>>>>>>>> KASAN report.  I have rewritten the changelog to reflect that, please
>>>>>>>> take a look:
>>>>>>>> 
>>>>>>>>  https://github.com/ceph/ceph-client/commit/85784f9395987a422fa04263e7c0fb13da11eb5c
>>>>>>> 
>>>>>>> Just as a side note, it's still trivial to get a cephfs-related KASAN
>>>>>>> report by running a fuzzer (trinity in my case) against a mainline
>>>>>>> kernel 4.16.0-rc7 with this fix backported.
>>>>>>> 
>>>>>> 
>>>>>> I suspect trinity is clobbering the array
>>>>>>> As Jeff mentioned in a different thread, splice syscall is broken on
>>>>>>> cephfs and the fix for it is still tagged as "DO NOT MERGE" in the
>>>>>>> ceph-client testing branch.  I still think there's some bug in Jeff's
>>>>>>> fix as I still see a crash occasionally, but I haven't yet finished
>>>>>>> debugging it.  Unfortunately, Jeff's fix is probably not appropriate for
>>>>>>> stable kernels (but may I'm wrong).
>>>>>>> 
>>>>>>> Cheers,
>>>>>> 
>>>>>> (cc'ing Al)
>>>>>> 
>>>>>> Yeah, I should have revisited this a while ago. So the deal is that I
>>>>>> posted this patch upstream around a year ago, but Al didn't really like
>>>>>> it:
>>>>>> 
>>>>>>    https://patchwork.kernel.org/patch/9541829/
>>>>>> 
>>>>>> He wanted to add some iov_iter_for_each_page infrastructure and base
>>>>>> the new function on that. I had thought he had something queued up
>>>>>> along those lines at one point, but I don't see anything in mainline so
>>>>>> far.
>>>>>> 
>>>>>> There is a iov_iter_for_each_range, but it doesn't handle userland
>>>>>> iovecs and doesn't seem to care about alignment. I think we'll need
>>>>>> both here.
>>>>>> 
>>>>>> I'm not sure however that a per-page callback is really that helpful
>>>>>> for callers like ceph_direct_read_write. We'll just end up having to do
>>>>>> more  work in the fs to batch up the pages before we add the op to the
>>>>>> req.
>>>>>> 
>>>>>> iov_iter_get_pages_alloc is really sort of what we want in this case,
>>>>>> and we want that function to stitch together longer arrays when it can.
>>>>>> 
>>>>>> Al, do you have any thoughts on what we should do here?
>>>>> 
>>>>> I think this can be fixed entirely in ceph by walking away from page
>>>>> vectors (barely tested):
>>>>> 
>>>>>  https://github.com/ceph/ceph-client/commits/wip-splice
>>>>> 
>>>>> If folks would like to see something like iov_iter_get_bvecs{,_alloc}()
>>>>> in iov_iter.c, I can work on generalizing my helper (i.e. add maxbvecs
>>>>> and maxsize parameters -- I didn't need maxbvecs and maxsize is just
>>>>> "make a copy of the iterator and truncate it").  Otherwise we will
>>>>> probably merge it into ceph, as this code has remained broken for way
>>>>> too long...
>>>>> 
>>>>> Attached is the main patch from wip-splice.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>>                Ilya
>>>> 
>>>> I haven't (yet) spent much time looking at the code, but a quick test
>>>> shows me the below failure in the sanity() tests from lib/iov_iter.c.
>>>> 
>>>> Cheers,
>>>> --
>>>> Luís
>>>> 
>>>> 
>>>> [  235.134588] idx = 7, offset = 4096
>>>> [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
>>>> [  235.135388] [          (null) 00000000a5376b82 0 4096]
>>>> [  235.135914] [          (null) 00000000fb52416a 0 4096]
>>>> [  235.136421] [          (null) 0000000009c7c0fc 0 4096]
>>>> [  235.136901] [          (null) 000000008b9911f0 0 4096]
>>>> [  235.137411] [          (null) 0000000042e4a3e2 0 4096]
>>>> [  235.137879] [          (null) 00000000110f4fc8 0 4096]
>>>> [  235.138358] [          (null) 00000000e16bdd11 0 4096]
>>>> [  235.138910] [          (null) 000000007d61a8f6 0 4096]
>>>> [  235.139399] [00000000af9d005b 000000003b50ae93 0 4096]
>>>> [  235.139879] [          (null) 00000000c55231a4 0 4096]
>>>> [  235.140378] [          (null) 00000000eba243d2 0 4096]
>>>> [  235.140851] [          (null) 00000000f5617952 0 4096]
>>>> [  235.141344] [          (null) 00000000fec2c691 0 4096]
>>>> [  235.141836] [          (null) 00000000ca09f9a4 0 4096]
>>>> [  235.142360] [          (null) 0000000087cfbb92 0 4096]
>>>> [  235.142886] [          (null) 000000009432c839 0 4096]
>>>> [  235.143441] WARNING: CPU: 1 PID: 261 at lib/iov_iter.c:350 sanity+0x162/0x210
>>>> [  235.144208] CPU: 1 PID: 261 Comm: trinity-c5 Not tainted 4.16.0-rc7+ #52
>>>> [  235.144942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>> [  235.146057] RIP: 0010:sanity+0x162/0x210
>>>> [  235.146509] RSP: 0018:ffff8800656970e8 EFLAGS: 00010246
>>>> [  235.147078] RAX: 0000000000000000 RBX: ffff88006852c868 RCX: ffffffff81589cb3
>>>> [  235.147841] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff88006852c8a8
>>>> [  235.148636] RBP: 0000000000000010 R08: fffffbfff059aa08 R09: ffff88006852c8e0
>>>> [  235.149470] R10: 0000000000000003 R11: fffffbfff059aa07 R12: ffff88006406dd38
>>>> [  235.150340] R13: ffff88006852c8a8 R14: 0000000000000000 R15: 0000000000001000
>>>> [  235.151184] FS:  00007fd75163eb40(0000) GS:ffff88006d700000(0000) knlGS:0000000000000000
>>>> [  235.152028] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  235.152562] CR2: 00007fd751501000 CR3: 000000006a940000 CR4: 00000000000006a0
>>>> [  235.153232] DR0: 00007fd7514fe000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  235.153887] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000f0602
>>>> [  235.154644] Call Trace:
>>>> [  235.154914]  iov_iter_get_pages+0x2a3/0x680
>>>> [  235.155377]  ? iov_iter_gap_alignment+0x490/0x490
>>>> [  235.155882]  ? iov_iter_advance+0x360/0x780
>>>> [  235.156334]  ? generic_file_splice_read+0x21b/0x320
>>>> [  235.156993]  ? SyS_splice+0x914/0x9a0
>>>> [  235.157498]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.158177]  ? iov_iter_copy_from_user_atomic+0x6e0/0x6e0
>>>> [  235.158957]  ? kmemleak_disable+0x70/0x70
>>>> [  235.159456]  __iter_get_bvecs+0x117/0x290
>>>> [  235.159955]  ? ceph_fallocate+0xc10/0xc10
>>>> [  235.160455]  ? iov_iter_npages+0x309/0x540
>>>> [  235.160974]  ? __kmalloc+0xf9/0x1f0
>>>> [  235.161410]  ceph_direct_read_write+0x4e7/0x1380
>>>> [  235.161981]  ? ceph_aio_complete_req+0x750/0x750
>>>> [  235.162503]  ? ceph_write_iter+0x103b/0x1530
>>>> [  235.162986]  ? __rb_erase_color+0xd50/0xd50
>>>> [  235.163449]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.164043]  ? ceph_fscache_revalidate_cookie+0x80/0x1d0
>>>> [  235.164630]  ? ceph_fscache_unregister_fs+0x1f0/0x1f0
>>>> [  235.165201]  ? __ceph_mark_dirty_caps+0x212/0x610
>>>> [  235.165713]  ? ceph_get_caps+0x2bc/0x700
>>>> [  235.166147]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.166681]  ? ceph_put_cap_refs+0x5a0/0x5a0
>>>> [  235.167182]  ? __delete_object+0xcd/0x100
>>>> [  235.167636]  ? put_object+0x40/0x40
>>>> [  235.168033]  ? up_read+0x20/0x20
>>>> [  235.168396]  ? ceph_write_iter+0x103b/0x1530
>>>> [  235.168865]  ? kmem_cache_free+0x80/0x1d0
>>>> [  235.169252]  ? ceph_write_iter+0x1047/0x1530
>>>> [  235.169692]  ? ceph_read_iter+0xc3e/0x10b0
>>>> [  235.170097]  ceph_read_iter+0xc3e/0x10b0
>>>> [  235.170519]  ? __tty_buffer_request_room+0xd0/0x320
>>>> [  235.171138]  ? ceph_write_iter+0x1530/0x1530
>>>> [  235.171667]  ? timerqueue_add+0xd2/0x100
>>>> [  235.172239]  ? enqueue_hrtimer+0xfb/0x1d0
>>>> [  235.172789]  ? hrtimer_reprogram+0x130/0x130
>>>> [  235.173361]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.173913]  ? _raw_spin_unlock_bh+0xc0/0xc0
>>>> [  235.174464]  ? hrtimer_start_range_ns+0x32d/0x440
>>>> [  235.174979]  ? hrtimer_forward+0x120/0x120
>>>> [  235.175394]  ? hrtimer_try_to_cancel+0x83/0x250
>>>> [  235.176033]  ? hrtimer_active+0x240/0x240
>>>> [  235.176630]  ? __hrtimer_get_remaining+0xc6/0x100
>>>> [  235.177285]  ? hrtimer_start_range_ns+0x440/0x440
>>>> [  235.177980]  ? generic_file_splice_read+0x21b/0x320
>>>> [  235.178694]  ? ceph_write_iter+0x1530/0x1530
>>>> [  235.179325]  generic_file_splice_read+0x21b/0x320
>>>> [  235.180011]  ? splice_shrink_spd+0x40/0x40
>>>> [  235.180619]  ? schedule+0x50/0x280
>>>> [  235.181126]  ? rw_verify_area+0x1f/0x70
>>>> [  235.181641]  ? do_splice_to+0x6a/0xc0
>>>> [  235.182075]  SyS_splice+0x914/0x9a0
>>>> [  235.182405]  ? __task_pid_nr_ns+0x1a9/0x1c0
>>>> [  235.182825]  ? compat_SyS_vmsplice+0x100/0x100
>>>> [  235.183266]  ? do_syscall_64+0x8d/0x300
>>>> [  235.183701]  ? compat_SyS_vmsplice+0x100/0x100
>>>> [  235.184223]  do_syscall_64+0x164/0x300
>>>> [  235.184662]  ? syscall_return_slowpath+0x1c0/0x1c0
>>>> [  235.185213]  ? page_fault+0x25/0x50
>>>> [  235.185616]  ? syscall_return_slowpath+0x13c/0x1c0
>>>> [  235.186124]  ? prepare_exit_to_usermode+0xdb/0x140
>>>> [  235.186692]  ? syscall_trace_enter+0x320/0x320
>>>> [  235.187206]  ? __put_user_4+0x1c/0x30
>>>> [  235.187858]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>> [  235.188726] RIP: 0033:0x7fd750f61229
>>>> [  235.189312] RSP: 002b:00007fff1ad745c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
>>>> [  235.190568] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd750f61229
>>>> [  235.191687] RDX: 000000000000003d RSI: 0000000000000000 RDI: 0000000000000098
>>>> [  235.192598] RBP: 00007fd751614000 R08: 0000000000fffff7 R09: 0000000000000001
>>>> [  235.193528] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000113
>>>> [  235.194443] R13: 00007fd751614048 R14: 00007fd75163ead8 R15: 00007fd751614000
>>>> [  235.195364] Code: 4c 89 ff e8 f1 43 cf ff 8b 73 3c 44 89 e1 48 c7 c7 80 13 11 82 89 ea e8 87 8d b8 ff 4c 89 ef e8 d5 43 cf ff 8b 43 40 85 c0 75 13 <0f> 0b 48 83 c4 10 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 8d 
>>>> [  235.197769] ---[ end trace 0228543143d962c4 ]---
>>> 
>>> Interesting. It really looks like trinity is thrashing the pipe in such
>>> a way to create races while we're trying to iterate over it? ceph is
>>> just calling pipe_get_pages, but something is hosed down inside the pipe
>>> tracking and it fails before we even get to doing anything with the
>>> pipe.
>>> 
>>> I'm not that familiar with the pipe code, so I'm not sure what prevents
>>> that info in the pipe inode from changing while we're trying to sanity
>>> check it here.
>>> 
> 
> Looks like the pipe_lock is what protects it, and the splice code takes
> that at a fairly high level. I think the problem is elsewhere.
> 
>>> Does this testcase work properly on other filesystems?
>> 
>> I can't reproduce it on btrfs or ext4, but it's possible that it's just
>> more difficult to trigger.
>> 
> 
> I can reproduce what you're seeing with trinity. It fails almost
> immediately for me on cephfs, but runs more or less indefinitely on xfs.
> That suggests that there is something wrong with what's currently in
> ceph-client.
> 
> What's there _does_ seem to fix xfstests generic/095, which implies that
> there some raciness involved.
> 
> One interesting bit:
> 
>    [  235.134588] idx = 7, offset = 4096
>    [  235.134917] curbuf = 8, nrbufs = 1, buffers = 16
> 
> The offset always seems to be 4096 when I see this problem. I'm not sure
> if that's significant yet.
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux