Re: Filesystem crashes due to pages without buffers

Leon Romanovsky <leon@xxxxxxxxxx> · Thu, 4 Jan 2018 08:10:53 +0200

On Wed, Jan 03, 2018 at 11:04:30AM +0100, Jan Kara wrote:
> Hello,
>
> Over the years I have seen so far unexplained crashed in filesystem's
> (ext4, xfs) writeback path due to dirty pages without buffers attached to
> them (see [1] and [2] for relatively recent reports). This was confusing as
> reclaim takes care not to strip buffers from a dirty page and both
> filesystems do add buffers to a page when it is first written to - in
> ->page_mkwrite() and ->write_begin callbacks.
>
> Recently I have come across a code path that is probably leading to this
> inconsistent state and I'd like to discuss how to best fix the problem
> because it's not obvious to me. Consider the following race:
>
> CPU1					CPU2
>
> addr = mmap(file1, MAP_SHARED, ...);
> fd2 = open(file2, O_DIRECT | O_RDONLY);
> read(fd2, addr, len)
>   do_direct_IO()
>     page = dio_get_page()
>       dio_refill_pages()
>         iov_iter_get_pages()
> 	  get_user_pages_fast()
>             - page fault
>               ->page_mkwrite()
>                 block_page_mkwrite()
>                   lock_page(page);
>                   - attaches buffers to page
>                   - makes sure blocks are allocated
>                   set_page_dirty(page)
>               - install writeable PTE
>               unlock_page(page);
>     submit_page_section(page)
>       - submits bio with 'page' as a buffer
> 					kswapd reclaims pages:
> 					...
> 					shrink_page_list()
> 					  trylock_page(page) - this is the
> 					    page CPU1 has just faulted in
> 					  try_to_unmap(page)
> 					  pageout(page);
> 					    clear_page_dirty_for_io(page);
> 					    ->writepage()
> 					  - let's assume page got written
> 					    out fast enough, alternatively
> 					    we could get to the same path as
> 					    soon as the page IO completes
> 					  if (page_has_private(page)) {
> 					    try_to_release_page(page)
> 					      - reclaims buffers from the
> 					        page
> 					   __remove_mapping(page)
> 					     - fails as DIO code still
> 					       holds page reference
> ...
>
> eventually read completes
>   dio_bio_complete(bio)
>     set_page_dirty_lock(page)
>       Bummer, we've just marked the page as dirty without having buffers.
>       Eventually writeback will find it and filesystem will complain...
>
> Am I missing something?
>
> The problem here is that filesystems fundamentally assume that a page can
> be written to only between ->write_begin - ->write_end (in this interval
> the page is locked), or between ->page_mkwrite - ->writepage and above is
> an example where this does not hold because when a page reference is
> acquired through get_user_pages(), page can get written to by the holder of
> the reference and dirtied even after it has been unmapped from page tables
> and ->writepage has been called. This is not only a cosmetic issue leading
> to assertion failure but it can also lead to data loss, data corruption, or
> other unpleasant surprises as filesystems assume page contents cannot be
> modified until either ->write_begin() or ->page_mkwrite gets called and
> those calls are serialized by proper locking with problematic operations
> such as hole punching etc.
>
> I'm not sure how to fix this problem. We could 'simulate' a writeable page
> fault in set_page_dirty_lock(). It is a bit ugly since we don't have a
> virtual address of the fault, don't hold mmap_sem, etc., possibly
> expensive, but it would make filesystems happy. Data stored by GUP user
> (e.g. read by DIO in the above case) could still get lost if someone e.g.
> punched hole under the buffer or otherwise messed with the underlying
> storage of the page while DIO was running but arguably users could expect
> such outcome.
>
> Another possible solution would be to make sure page is writeably mapped
> until GUP user drops its reference. That would be arguably cleaner but
> probably that would mean we have to track number of writeable GUP page
> references separately (no space space in struct page is a problem here) and
> block page_mkclean() until they are dropped. Also for long term GUP users
> like Infiniband or V4L we'd have to come up with some solution as we should
> not block page_mkclean() for so long.
>
> As a side note DAX needs some solution for GUP users as well. The problems
> are similar there in nature, just much easier to hit. So at least a
> solution for long-term GUP users can (and I strongly believe should) be
> shared between standard and DAX paths.
>
> Anybody has other ideas how to fix the problem or opinions on which
> solution would be better to use or some complications I have missed?
>

+RDMA

Hi Jan,

I don't have actual proposals how to fix, but wanted to mention that
we have a customer who experiences those failures in his setup.
In his case, it is reproducible in 100% of cases in approximately 2
minutes of run.

His application creates two memory regions with ib_umem_get(), one is
backed by ext4 and another is anonymous. Approximately after two minutes
of data traffic, he stops the system and calls to release those memory
regions with ib_umem_release()->__ib_umem_release()->set_page_dirty_lock().

A couple of seconds later, he hits the following BUG_ON.

[ 1411.545311] ------------[ cut here ]------------
[ 1411.545340] kernel BUG at fs/ext4/inode.c:2297!
[ 1411.545360] invalid opcode: 0000 [#1] SMP
[ 1411.545381] Modules linked in: xt_nat veth ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack
br_netfilter bridge stp llc overlay(T) rdma_ucm(OE) ib_ucm(OE)
rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE)
mlx4_core(OE) devlink mlx_compat(OE) intel_powerclamp coretemp
intel_rapl iosf_mbi kvm_intel vfat fat kvm irqbypass crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul ext4 glue_helper
ablk_helper cryptd mbcache jbd2 iTCO_wdt iTCO_vendor_support mxm_wmi
pcspkr sb_edac edac_core i2c_i801 sg mei_me mei shpchp lpc_ich
ipmi_devintf ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter
knem(OE) nfsd auth_rpcgss
[ 1411.545744] nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
crc_t10dif crct10dif_generic ast drm_kms_helper crct10dif_pclmul
crct10dif_common syscopyarea sysfillrect crc32c_intel mpt3sas sysimgblt
fb_sys_fops ttm ahci raid_class libahci scsi_transport_sas igb drm
libata dca i2c_algo_bit ptp nvme i2c_core pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: devlink]
[ 1411.545926] CPU: 6 PID: 13195 Comm: node_runner_w8 Tainted: G W OE
------------ T 3.10.0-514.21.1.el7.debug_bz1368895.x86_64 #1
[ 1411.545975] Hardware name: Quanta Computer Inc D51BP-1U (dual 1G LoM)/S2BP-MB (dual 1G LoM), BIOS S2BP3B04 03/03/2016
[ 1411.546017] task: ffff881e4b323ec0 ti: ffff881e49bbc000 task.ti: ffff881e49bbc000
[ 1411.546047] RIP: 0010:[<ffffffffa07083e5>] [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.546103] RSP: 0018:ffff881e49bbfc10 EFLAGS: 00010246
[ 1411.546125] RAX: 001fffff0000003d RBX: ffff881e49bbfc68 RCX: 0000000000000170
[ 1411.546154] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88207ff8dde8
[ 1411.546183] RBP: ffff881e49bbfce8 R08: 0000000000000000 R09: 0000000000000001
[ 1411.546212] R10: 57fe04c2df4f6680 R11: 0000000000000008 R12: 7ffffffffffffe9e
[ 1411.546240] R13: 000000000003ffff R14: ffffea0001449a00 R15: ffff881e49bbfd90
[ 1411.546270] FS: 00007f5dd5de7d40(0000) GS:ffff881fffb80000(0000) knlGS:0000000000000000
[ 1411.546302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1411.546325] CR2: 00007f5b0c5969c0 CR3: 0000001e5e6ae000 CR4: 00000000003407e0
[ 1411.546354] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1411.546383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1411.546412] Stack:
[ 1411.546421] ffff881e49bbfc50 0000000000000002 ffff881f07b00628 ffff881e49bbfcb8
[ 1411.546456] 000000000000017b 000000000000000e 0000000000000000 ffffea00794f6600
[ 1411.546490] ffffea00794f6640 ffffea00794f6680 ffffea0001449a00 ffffea00014499c0
[ 1411.546524] Call Trace:
[ 1411.546545] [<ffffffffa070c9ab>] ext4_writepages+0x45b/0xd60 [ext4]
[ 1411.546576] [<ffffffff8118d93e>] do_writepages+0x1e/0x40
[ 1411.546601] [<ffffffff811824f5>] __filemap_fdatawrite_range+0x65/0x80
[ 1411.546629] [<ffffffff81182641>] filemap_write_and_wait_range+0x41/0x90
[ 1411.546664] [<ffffffffa0703bba>] ext4_sync_file+0xba/0x320 [ext4]
[ 1411.546692] [<ffffffff8123028d>] vfs_fsync_range+0x1d/0x30
[ 1411.546717] [<ffffffff811ba89e>] SyS_msync+0x1fe/0x250
[ 1411.546741] [<ffffffff816974c9>] system_call_fastpath+0x16/0x1b
[ 1411.546765] Code: ff ff ff e8 2e 7e a8 e0 8b 85 40 ff ff ff eb c2 48
8d bd 50 ff ff ff e8 1a 7e a8 e0 eb 8c 4c 89 f7 e8 a0 81 a7 e0 e9 d5 fe
ff ff <0f> 0b 0f 0b e8 62 d6 97 e0 66 90 0f 1f 44 00 00 55 48 89 e5 41
[ 1411.548075] RIP [<ffffffffa07083e5>] mpage_prepare_extent_to_map+0x2d5/0x2e0 [ext4]
[ 1411.549292] RSP <ffff881e49bbfc10>
---here vmcore-dmesg got cut off----

Thanks

> 								Honza
>
> [1] https://www.spinics.net/lists/linux-xfs/msg10090.html
> [2] https://www.spinics.net/lists/linux-ext4/msg54377.html
>
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
Attachment:
signature.asc

Description: PGP signature