Hi! On Fri 13-04-18 20:39:06, Gavin Guo wrote: > On Thu, Jan 4, 2018 at 7:33 PM, Jan Kara <jack@xxxxxxx> wrote: > > > > On Wed 03-01-18 20:56:32, Dan Williams wrote: > > > On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > Hello, > > > > > > > > Over the years I have seen so far unexplained crashed in filesystem's > > > > (ext4, xfs) writeback path due to dirty pages without buffers attached to > > > > them (see [1] and [2] for relatively recent reports). This was confusing as > > > > reclaim takes care not to strip buffers from a dirty page and both > > > > filesystems do add buffers to a page when it is first written to - in > > > > ->page_mkwrite() and ->write_begin callbacks. > > > > > > > > Recently I have come across a code path that is probably leading to this > > > > inconsistent state and I'd like to discuss how to best fix the problem > > > > because it's not obvious to me. Consider the following race: > > > > > > > > CPU1 CPU2 > > > > > > > > addr = mmap(file1, MAP_SHARED, ...); > > > > fd2 = open(file2, O_DIRECT | O_RDONLY); > > > > read(fd2, addr, len) > > > > do_direct_IO() > > > > page = dio_get_page() > > > > dio_refill_pages() > > > > iov_iter_get_pages() > > > > get_user_pages_fast() > > > > - page fault > > > > ->page_mkwrite() > > > > block_page_mkwrite() > > > > lock_page(page); > > > > - attaches buffers to page > > > > - makes sure blocks are allocated > > > > set_page_dirty(page) > > > > - install writeable PTE > > > > unlock_page(page); > > > > submit_page_section(page) > > > > - submits bio with 'page' as a buffer > > > > kswapd reclaims pages: > > > > ... > > > > shrink_page_list() > > > > trylock_page(page) - this is the > > > > page CPU1 has just faulted in > > > > try_to_unmap(page) > > > > pageout(page); > > > > clear_page_dirty_for_io(page); > > > > ->writepage() > > > > - let's assume page got written > > > > out fast enough, alternatively > > > > we could get to the same path as > > > > soon as the page IO completes > > > > if (page_has_private(page)) { > > > > try_to_release_page(page) > > > > - reclaims buffers from the > > > > page > > > > __remove_mapping(page) > > > > - fails as DIO code still > > > > holds page reference > > > > ... > > > > > > > > eventually read completes > > > > dio_bio_complete(bio) > > > > set_page_dirty_lock(page) > > > > Bummer, we've just marked the page as dirty without having buffers. > > > > Eventually writeback will find it and filesystem will complain... > > > > > > > > Am I missing something? > > > > > > > > The problem here is that filesystems fundamentally assume that a page can > > > > be written to only between ->write_begin - ->write_end (in this interval > > > > the page is locked), or between ->page_mkwrite - ->writepage and above is > > > > an example where this does not hold because when a page reference is > > > > acquired through get_user_pages(), page can get written to by the holder of > > > > the reference and dirtied even after it has been unmapped from page tables > > > > and ->writepage has been called. This is not only a cosmetic issue leading > > > > to assertion failure but it can also lead to data loss, data corruption, or > > > > other unpleasant surprises as filesystems assume page contents cannot be > > > > modified until either ->write_begin() or ->page_mkwrite gets called and > > > > those calls are serialized by proper locking with problematic operations > > > > such as hole punching etc. > > > > > > > > I'm not sure how to fix this problem. We could 'simulate' a writeable page > > > > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a > > > > virtual address of the fault, don't hold mmap_sem, etc., possibly > > > > expensive, but it would make filesystems happy. Data stored by GUP user > > > > (e.g. read by DIO in the above case) could still get lost if someone e.g. > > > > punched hole under the buffer or otherwise messed with the underlying > > > > storage of the page while DIO was running but arguably users could expect > > > > such outcome. > > > > > > > > Another possible solution would be to make sure page is writeably mapped > > > > until GUP user drops its reference. That would be arguably cleaner but > > > > probably that would mean we have to track number of writeable GUP page > > > > references separately (no space space in struct page is a problem here) and > > > > block page_mkclean() until they are dropped. Also for long term GUP users > > > > like Infiniband or V4L we'd have to come up with some solution as we should > > > > not block page_mkclean() for so long. > > > > > > Do we need to block page_mkclean, or could we defer buffer reclaiming > > > to the last put of the page? > > > > As I wrote to Dave the problem is no so much with reclaiming of buffers but > > with the fact filesystems don't expect page can be dirtied after > > page_mkclean() is finished. > > > > > I think once we have the "register memory with lease" mechanism for > > > Infiniband we could expand it to the page cache case. The problem is > > > the regression this would cause with userspace that expects it can > > > maintain file backed memory registrations indefinitely. > > > > > > What are the implications of holding off page_mkclean or release > > > buffers indefinitely? > > > > Bad. You cannot write the page to disk until page_mkclean() finishes as > > page_mkclean() is part of clear_page_dirty_for_io(). And we really do need > > that functionality there e.g. to make sure tail of the last page in the > > file is properly zeroed out, storage with DIF/DIX can compute checksum of > > the data safely before submitting it to the device etc. > > > > > Is an indefinite / interruptible sleep waiting for the 'put' event of > > > a get_user_pages() page unacceptable? The current case that the file > > > contents will not be coherent with respect to in-flight RDMA, perhaps > > > waiting for that to complete is better than cleaning buffers from the > > > page prematurely. > > > > Yeah, indefinite sleep is really a no-go. > > > > > > As a side note DAX needs some solution for GUP users as well. The problems > > > > are similar there in nature, just much easier to hit. So at least a > > > > solution for long-term GUP users can (and I strongly believe should) be > > > > shared between standard and DAX paths. > > > > > > In the DAX case we rely on the fact that when the page goes idle we > > > only need to worry about the filesytem block map changing, the page > > > won't get reallocated somewhere else. We can't use page idle as an > > > event in this case, however, if the page reference count is one then > > > the DIO code can know that it has the page exclusively, so maybe DAX > > > and non-DAX can share the page count == 1 event notification. > > > > The races I describe do not need to involve truncate / hole punching. It is > > just enough to race with page writeback. So page references are of no use > > here. We would have to specifically track number of references acquired by > > GUP or something like that. So what I wanted to share with DAX is the > > long-term pin handling, the rest is unclear for now. > > > > Honza > > -- > > Jan Kara <jack@xxxxxxxx> > > SUSE Labs, CR > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > > The bug can be reliably reproduced in our platform with the > current upstream kernel(80aa76bcd364 Merge tag 'xfs-4.17-merge-4' of > git://git.kernel.org/pub/scm/fs/xfs/xfs-linux). I'm happy to > help to test and debug. The error message is as following: Thanks for report! So what workload do you run to trigger this? Is it just a direct IO read to a buffer in a shared file mapping or something else? Honza > kernel BUG at /home/gavin/work-kernel/fs/ext4/inode.c:2126! > invalid opcode: 0000 [#1] SMP PTI > Modules linked in: veth ipt_MASQUERADE nf_nat_masquerade_ipv4 > nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat > nf_conntrack br_netfilter bridge stp llc overlay xt_multiport > iptable_filter ip_tables x_tables cachefiles fscache esp6_offload esp6 > esp4_offload esp4 xfrm_algo nls_iso8859_1 intel_rapl sb_edac > x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel > nvidia_uvm(POE) mxm_wmi kvm joydev input_leds irqbypass intel_cstate > intel_rapl_perf mei_me ipmi_si shpchp lpc_ich mei acpi_power_meter > mac_hid wmi ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp > libiscsi scsi_transport_iscsi ipmi_devintf sunrpc ipmi_msghandler > autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov > async_memcpy > async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 > multipath linear i2c_algo_bit nvidia_drm(POE) ses crct10dif_pclmul > crc32_pclmul nvidia_modeset(POE) ttm ghash_clmulni_intel enclosure > hid_generic uas pcbc scsi_transport_sas drm_kms_helper usbhid > aesni_intel hid aes_x86_64 usb_storage syscopyarea crypto_simd > sysfillrect mlx5_core cryptd nvidia(POE) sysimgblt glue_helper ixgbe > mlxfw megaraid_sas fb_sys_fops devlink dca ahci ptp drm libahci > pps_core mdio > CPU: 54 PID: 8938 Comm: kworker/u161:0 Tainted: P OE > 4.16.0-999-generic #201804102200 > > Workqueue: writeback wb_workfn (flush-8:0) > RIP: 0010:ext4_writepage+0x318/0x770 > RSP: 0018:ffffb514e76cb7f8 EFLAGS: 00010246 > RAX: 00500b4e8000026d RBX: 0000000000001000 RCX: ffff8c2cafba5000 > RDX: ffff8bec4069a020 RSI: ffffb514e76cbc28 RDI: ffffe91efcc8bd80 > RBP: ffffb514e76cb870 R08: 0000000000028115 R09: 00000000000280c0 > R10: 0000000000000002 R11: ffff8c2dbffd4000 R12: ffffe91efcc8bd80 > R13: ffff8bec40699ea8 R14: ffffb514e76cbc28 R15: ffffe91efcc8bd80 > FS: 0000000000000000(0000) GS:ffff8becbfc80000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007ffcc8f6f080 CR3: 0000004b9ce0a006 CR4: 00000000003606e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > ? rmap_walk+0x41/0x60 > ? page_mkclean+0x9f/0xb0 > ? invalid_page_referenced_vma+0x80/0x80 > __writepage+0x17/0x50 > write_cache_pages+0x228/0x4a0 > ? __wb_calc_thresh+0x140/0x140 > generic_writepages+0x61/0xa0 > ? _cond_resched+0x1a/0x50 > ? write_cache_pages+0x396/0x4a0 > ext4_writepages+0x1fc/0xe00 > ? ext4_writepages+0x1fc/0xe00 > ? generic_writepages+0x6d/0xa0 > ? fprop_fraction_percpu+0x2f/0x80 > do_writepages+0x1c/0x60 > ? do_writepages+0x1c/0x60 > __writeback_single_inode+0x45/0x320 > writeback_sb_inodes+0x266/0x580 > __writeback_inodes_wb+0x92/0xc0 > wb_writeback+0x282/0x310 > wb_workfn+0x1a3/0x440 > ? wb_workfn+0x1a3/0x440 > process_one_work+0x1db/0x3c0 > worker_thread+0x4b/0x420 > kthread+0x102/0x140 > ? rescuer_thread+0x380/0x380 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x35/0x40 > Code: ff f6 c4 08 0f 85 58 ff ff ff e8 68 56 00 00 ba 00 10 00 00 31 > f6 41 bd fb ff ff ff e8 a2 9e ff ff 4c 89 e7 e8 fa 39 ea ff eb 8a <0f> > 0b c6 45 a8 00 e9 f0 fd ff ff 49 83 7c 24 10 00 0f 85 c7 03 > RIP: ext4_writepage+0x318/0x770 RSP: ffffb514e76cb7f8 > ---[ end trace 59d4e1a4b221404b ]--- -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html