Re: Filesystem crashes due to pages without buffers

Jan Kara <jack@xxxxxxx> · Fri, 13 Apr 2018 14:58:15 +0200

Hi!

On Fri 13-04-18 20:39:06, Gavin Guo wrote:
> On Thu, Jan 4, 2018 at 7:33 PM, Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Wed 03-01-18 20:56:32, Dan Williams wrote:
> > > On Wed, Jan 3, 2018 at 2:04 AM, Jan Kara <jack@xxxxxxx> wrote:
> > > > Hello,
> > > >
> > > > Over the years I have seen so far unexplained crashed in filesystem's
> > > > (ext4, xfs) writeback path due to dirty pages without buffers attached to
> > > > them (see [1] and [2] for relatively recent reports). This was confusing as
> > > > reclaim takes care not to strip buffers from a dirty page and both
> > > > filesystems do add buffers to a page when it is first written to - in
> > > > ->page_mkwrite() and ->write_begin callbacks.
> > > >
> > > > Recently I have come across a code path that is probably leading to this
> > > > inconsistent state and I'd like to discuss how to best fix the problem
> > > > because it's not obvious to me. Consider the following race:
> > > >
> > > > CPU1                                    CPU2
> > > >
> > > > addr = mmap(file1, MAP_SHARED, ...);
> > > > fd2 = open(file2, O_DIRECT | O_RDONLY);
> > > > read(fd2, addr, len)
> > > >   do_direct_IO()
> > > >     page = dio_get_page()
> > > >       dio_refill_pages()
> > > >         iov_iter_get_pages()
> > > >           get_user_pages_fast()
> > > >             - page fault
> > > >               ->page_mkwrite()
> > > >                 block_page_mkwrite()
> > > >                   lock_page(page);
> > > >                   - attaches buffers to page
> > > >                   - makes sure blocks are allocated
> > > >                   set_page_dirty(page)
> > > >               - install writeable PTE
> > > >               unlock_page(page);
> > > >     submit_page_section(page)
> > > >       - submits bio with 'page' as a buffer
> > > >                                         kswapd reclaims pages:
> > > >                                         ...
> > > >                                         shrink_page_list()
> > > >                                           trylock_page(page) - this is the
> > > >                                             page CPU1 has just faulted in
> > > >                                           try_to_unmap(page)
> > > >                                           pageout(page);
> > > >                                             clear_page_dirty_for_io(page);
> > > >                                             ->writepage()
> > > >                                           - let's assume page got written
> > > >                                             out fast enough, alternatively
> > > >                                             we could get to the same path as
> > > >                                             soon as the page IO completes
> > > >                                           if (page_has_private(page)) {
> > > >                                             try_to_release_page(page)
> > > >                                               - reclaims buffers from the
> > > >                                                 page
> > > >                                            __remove_mapping(page)
> > > >                                              - fails as DIO code still
> > > >                                                holds page reference
> > > > ...
> > > >
> > > > eventually read completes
> > > >   dio_bio_complete(bio)
> > > >     set_page_dirty_lock(page)
> > > >       Bummer, we've just marked the page as dirty without having buffers.
> > > >       Eventually writeback will find it and filesystem will complain...
> > > >
> > > > Am I missing something?
> > > >
> > > > The problem here is that filesystems fundamentally assume that a page can
> > > > be written to only between ->write_begin - ->write_end (in this interval
> > > > the page is locked), or between ->page_mkwrite - ->writepage and above is
> > > > an example where this does not hold because when a page reference is
> > > > acquired through get_user_pages(), page can get written to by the holder of
> > > > the reference and dirtied even after it has been unmapped from page tables
> > > > and ->writepage has been called. This is not only a cosmetic issue leading
> > > > to assertion failure but it can also lead to data loss, data corruption, or
> > > > other unpleasant surprises as filesystems assume page contents cannot be
> > > > modified until either ->write_begin() or ->page_mkwrite gets called and
> > > > those calls are serialized by proper locking with problematic operations
> > > > such as hole punching etc.
> > > >
> > > > I'm not sure how to fix this problem. We could 'simulate' a writeable page
> > > > fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> > > > virtual address of the fault, don't hold mmap_sem, etc., possibly
> > > > expensive, but it would make filesystems happy. Data stored by GUP user
> > > > (e.g. read by DIO in the above case) could still get lost if someone e.g.
> > > > punched hole under the buffer or otherwise messed with the underlying
> > > > storage of the page while DIO was running but arguably users could expect
> > > > such outcome.
> > > >
> > > > Another possible solution would be to make sure page is writeably mapped
> > > > until GUP user drops its reference. That would be arguably cleaner but
> > > > probably that would mean we have to track number of writeable GUP page
> > > > references separately (no space space in struct page is a problem here) and
> > > > block page_mkclean() until they are dropped. Also for long term GUP users
> > > > like Infiniband or V4L we'd have to come up with some solution as we should
> > > > not block page_mkclean() for so long.
> > >
> > > Do we need to block page_mkclean, or could we defer buffer reclaiming
> > > to the last put of the page?
> >
> > As I wrote to Dave the problem is no so much with reclaiming of buffers but
> > with the fact filesystems don't expect page can be dirtied after
> > page_mkclean() is finished.
> >
> > > I think once we have the "register memory with lease" mechanism for
> > > Infiniband we could expand it to the page cache case. The problem is
> > > the regression this would cause with userspace that expects it can
> > > maintain file backed memory registrations indefinitely.
> > >
> > > What are the implications of holding off page_mkclean or release
> > > buffers indefinitely?
> >
> > Bad. You cannot write the page to disk until page_mkclean() finishes as
> > page_mkclean() is part of clear_page_dirty_for_io(). And we really do need
> > that functionality there e.g. to make sure tail of the last page in the
> > file is properly zeroed out, storage with DIF/DIX can compute checksum of
> > the data safely before submitting it to the device etc.
> >
> > > Is an indefinite / interruptible sleep waiting for the 'put' event of
> > > a get_user_pages() page unacceptable? The current case that the file
> > > contents will not be coherent with respect to in-flight RDMA, perhaps
> > > waiting for that to complete is better than cleaning buffers from the
> > > page prematurely.
> >
> > Yeah, indefinite sleep is really a no-go.
> >
> > > > As a side note DAX needs some solution for GUP users as well. The problems
> > > > are similar there in nature, just much easier to hit. So at least a
> > > > solution for long-term GUP users can (and I strongly believe should) be
> > > > shared between standard and DAX paths.
> > >
> > > In the DAX case we rely on the fact that when the page goes idle we
> > > only need to worry about the filesytem block map changing, the page
> > > won't get reallocated somewhere else. We can't use page idle as an
> > > event in this case, however, if the page reference count is one then
> > > the DIO code can know that it has the page exclusively, so maybe DAX
> > > and non-DAX can share the page count == 1 event notification.
> >
> > The races I describe do not need to involve truncate / hole punching. It is
> > just enough to race with page writeback. So page references are of no use
> > here. We would have to specifically track number of references acquired by
> > GUP or something like that. So what I wanted to share with DAX is the
> > long-term pin handling, the rest is unclear for now.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@xxxxxxxx>
> > SUSE Labs, CR
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> 
> The bug can be reliably reproduced in our platform with the
> current upstream kernel(80aa76bcd364 Merge tag 'xfs-4.17-merge-4' of
> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux). I'm happy to
> help to test and debug. The error message is as following:

Thanks for report! So what workload do you run to trigger this? Is it just
a direct IO read to a buffer in a shared file mapping or something else?

								Honza

> kernel BUG at /home/gavin/work-kernel/fs/ext4/inode.c:2126!
> invalid opcode: 0000 [#1] SMP PTI
> Modules linked in: veth ipt_MASQUERADE nf_nat_masquerade_ipv4
> nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat
> nf_conntrack br_netfilter bridge stp llc overlay xt_multiport
> iptable_filter ip_tables x_tables cachefiles fscache esp6_offload esp6
> esp4_offload esp4 xfrm_algo nls_iso8859_1 intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel
> nvidia_uvm(POE) mxm_wmi kvm joydev input_leds irqbypass intel_cstate
> intel_rapl_perf mei_me ipmi_si shpchp lpc_ich mei acpi_power_meter
> mac_hid wmi ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ipmi_devintf sunrpc ipmi_msghandler
> autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
> async_memcpy
>  async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
> multipath linear i2c_algo_bit nvidia_drm(POE) ses crct10dif_pclmul
> crc32_pclmul nvidia_modeset(POE) ttm ghash_clmulni_intel enclosure
> hid_generic uas pcbc scsi_transport_sas drm_kms_helper usbhid
> aesni_intel hid aes_x86_64 usb_storage syscopyarea crypto_simd
> sysfillrect mlx5_core cryptd nvidia(POE) sysimgblt glue_helper ixgbe
> mlxfw megaraid_sas fb_sys_fops devlink dca ahci ptp drm libahci
> pps_core mdio
> CPU: 54 PID: 8938 Comm: kworker/u161:0 Tainted: P           OE
> 4.16.0-999-generic #201804102200
> 
> Workqueue: writeback wb_workfn (flush-8:0)
> RIP: 0010:ext4_writepage+0x318/0x770
> RSP: 0018:ffffb514e76cb7f8 EFLAGS: 00010246
> RAX: 00500b4e8000026d RBX: 0000000000001000 RCX: ffff8c2cafba5000
> RDX: ffff8bec4069a020 RSI: ffffb514e76cbc28 RDI: ffffe91efcc8bd80
> RBP: ffffb514e76cb870 R08: 0000000000028115 R09: 00000000000280c0
> R10: 0000000000000002 R11: ffff8c2dbffd4000 R12: ffffe91efcc8bd80
> R13: ffff8bec40699ea8 R14: ffffb514e76cbc28 R15: ffffe91efcc8bd80
> FS:  0000000000000000(0000) GS:ffff8becbfc80000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffcc8f6f080 CR3: 0000004b9ce0a006 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  ? rmap_walk+0x41/0x60
>  ? page_mkclean+0x9f/0xb0
>  ? invalid_page_referenced_vma+0x80/0x80
>  __writepage+0x17/0x50
>  write_cache_pages+0x228/0x4a0
>  ? __wb_calc_thresh+0x140/0x140
>  generic_writepages+0x61/0xa0
>  ? _cond_resched+0x1a/0x50
>  ? write_cache_pages+0x396/0x4a0
>  ext4_writepages+0x1fc/0xe00
>  ? ext4_writepages+0x1fc/0xe00
>  ? generic_writepages+0x6d/0xa0
>  ? fprop_fraction_percpu+0x2f/0x80
>  do_writepages+0x1c/0x60
>  ? do_writepages+0x1c/0x60
>  __writeback_single_inode+0x45/0x320
>  writeback_sb_inodes+0x266/0x580
>  __writeback_inodes_wb+0x92/0xc0
>  wb_writeback+0x282/0x310
>  wb_workfn+0x1a3/0x440
>  ? wb_workfn+0x1a3/0x440
>  process_one_work+0x1db/0x3c0
>  worker_thread+0x4b/0x420
>  kthread+0x102/0x140
>  ? rescuer_thread+0x380/0x380
>  ? kthread_create_worker_on_cpu+0x70/0x70
>  ret_from_fork+0x35/0x40
> Code: ff f6 c4 08 0f 85 58 ff ff ff e8 68 56 00 00 ba 00 10 00 00 31
> f6 41 bd fb ff ff ff e8 a2 9e ff ff 4c 89 e7 e8 fa 39 ea ff eb 8a <0f>
> 0b c6 45 a8 00 e9 f0 fd ff ff 49 83 7c 24 10 00 0f 85 c7 03
> RIP: ext4_writepage+0x318/0x770 RSP: ffffb514e76cb7f8
> ---[ end trace 59d4e1a4b221404b ]---
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR