On 09/11/2019 23:38, Dave Chinner wrote: > On Fri, Nov 08, 2019 at 07:01:15AM +0100, Arkadiusz Miśkiewicz wrote: >> >> Hello. >> >> I have two servers: >> >> backup4 - one with Adaptec ASR8885Q (that's the one which breaks so >> often but this time adaptes works) >> >> backup3 - other with software raid only >> >> Both are now running 5.3.8 kernels and both end up like this log below. >> It takes ~ up to day to reproduce. >> >> Here are more complete logs and kernel configs: >> >> https://ixion.pld-linux.org/~arekm/p2/xfs/ >> >> Any ideas what is happening and what can I do to help debug the problem? >> >>> Nov 8 00:55:19 backup4 kernel: WARNING: CPU: 5 PID: 25802 at fs/xfs/libxfs/xfs_bmap.c:4530 xfs_bmapi_convert_delalloc+0x434/0x4a0 [xfs] >>> Nov 8 00:55:19 backup4 kernel: Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_sfq nfnetlink_log nfnetlink xt_NFLOG xt_comment xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bpfilter xfs mlx4_ib ib_uverbs ib_core mlx4_en ses enclosure scsi_transport_sas joydev input_leds hid_generic usbhid hid coretemp intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper xhci_pci mlx4_core intel_cstate xhci_hcd intel_uncore ehci_pci ehci_hcd igb intel_rapl_perf pcspkr ipmi_si usbcore i2c_i801 mei_me ioatdma acpi_power_meter aacraid i2c_algo_bit ipmi_devintf mei dca lpc_ich i2c_core evdev ipmi_msghandler wmi hwmon acpi_pad button sch_fq_codel ext4 libcrc32c crc32c_generic crc32c_intel crc16 mbcache jbd2 sd_mod raid1 md_mod ahci libahci libata >>> Nov 8 00:55:19 backup4 kernel: scsi_mod >>> Nov 8 00:55:19 backup4 kernel: CPU: 5 PID: 25802 Comm: kworker/u65:3 Tainted: G T 5.3.8-1 #1 >>> Nov 8 00:55:19 backup4 kernel: Hardware name: Supermicro X10DRi/X10DRi, BIOS 3.0a 02/06/2018 >>> Nov 8 00:55:19 backup4 kernel: Workqueue: writeback wb_workfn (flush-8:48) >>> Nov 8 00:55:19 backup4 kernel: RIP: 0010:xfs_bmapi_convert_delalloc+0x434/0x4a0 [xfs] >>> Nov 8 00:55:19 backup4 kernel: Code: b7 c0 83 c0 01 e9 9c fd ff ff 41 80 bc 24 e9 00 00 00 03 49 8d 44 24 48 74 d9 c7 84 24 c0 00 00 00 01 00 00 00 e9 94 fd ff ff <0f> 0b 41 be e4 ff ff ff 48 8d 7c 24 30 44 89 f2 44 89 ee e8 84 6f >>> Nov 8 00:55:19 backup4 kernel: RSP: 0018:ffffac6346a5b838 EFLAGS: 00010246 >>> Nov 8 00:55:19 backup4 kernel: RAX: 0000000000000000 RBX: ffff95c8f972a000 RCX: 0000000000000022 >>> Nov 8 00:55:19 backup4 kernel: RDX: 0000000000001fda RSI: ffffffffffffffff RDI: ffff95c8f93b0600 >>> Nov 8 00:55:19 backup4 kernel: RBP: ffffac6346a5b938 R08: 0000000000000000 R09: 0000000000000001 >>> Nov 8 00:55:19 backup4 kernel: R10: ffff95c33e6f41d8 R11: 0000000000000026 R12: ffff95be37512a80 >>> Nov 8 00:55:19 backup4 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff95be37512ac8 >>> Nov 8 00:55:19 backup4 kernel: FS: 0000000000000000(0000) GS:ffff95c8ff940000(0000) knlGS:0000000000000000 >>> Nov 8 00:55:19 backup4 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> Nov 8 00:55:19 backup4 kernel: CR2: 00007ff938389000 CR3: 000000048720a001 CR4: 00000000003606e0 >>> Nov 8 00:55:19 backup4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> Nov 8 00:55:19 backup4 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Nov 8 00:55:19 backup4 kernel: Call Trace: >>> Nov 8 00:55:19 backup4 kernel: xfs_map_blocks+0x18e/0x420 [xfs] >>> Nov 8 00:55:19 backup4 kernel: xfs_do_writepage+0x11c/0x440 [xfs] >>> Nov 8 00:55:19 backup4 kernel: write_cache_pages+0x185/0x430 >>> Nov 8 00:55:19 backup4 kernel: ? xfs_vm_writepages+0x90/0x90 [xfs] >>> Nov 8 00:55:19 backup4 kernel: xfs_vm_writepages+0x5e/0x90 [xfs] >>> Nov 8 00:55:19 backup4 kernel: do_writepages+0x1c/0x60 >>> Nov 8 00:55:19 backup4 kernel: __writeback_single_inode+0x41/0x360 >>> Nov 8 00:55:19 backup4 kernel: writeback_sb_inodes+0x20c/0x490 >>> Nov 8 00:55:19 backup4 kernel: wb_writeback+0x12a/0x320 >>> Nov 8 00:55:19 backup4 kernel: wb_workfn+0xdd/0x4a0 >>> Nov 8 00:55:19 backup4 kernel: process_one_work+0x1d5/0x370 >>> Nov 8 00:55:19 backup4 kernel: worker_thread+0x4d/0x3d0 >>> Nov 8 00:55:19 backup4 kernel: kthread+0xfb/0x140 >>> Nov 8 00:55:19 backup4 kernel: ? process_one_work+0x370/0x370 >>> Nov 8 00:55:19 backup4 kernel: ? kthread_park+0x80/0x80 >>> Nov 8 00:55:19 backup4 kernel: ret_from_fork+0x35/0x40 >>> Nov 8 00:55:19 backup4 kernel: ---[ end trace a08de0c8c2851936 ]--- >>> Nov 8 00:55:19 backup4 kernel: XFS (sdd1): page discard on page 00000000e56e0621, inode 0x7a17f4206, offset 10354688. > > This goes with what you reported on #xfs: > > [9/11/19 08:04] <arekm> [ 54.777715] Filesystem "sdd1": reserve blocks depleted! Consider increasing reserve pool size. > > And: > > [9/11/19 08:16] <arekm> ah, yes [ 54.777720] XFS (sdd1): Per-AG reservation for AG 38 failed. Filesystem may run out of space A note - I got these messages only once on backup4. Didn't find such messages in old logs (on both machines). > which implies that we are regularly under-estimating delayed > allocation metadata reservations, or a large delalloc extent is > being split repeatedly on writeback. e.g. severe file fragmentation > due to operating at/near ENOSPC resulting in a 100MB delalloc extent > being split into individual single block allocations. > > Eventually this will deplete the reserve pool and trigger ENOSPC, > which will eventually result in a dirty allocation transaction being > cancelled and shutdown occurring. Did anything important in that area change between 4.20 (where I had no problems) and 5.1 (first occurrence of assert hit on both machines) ? I see xfs_bmapi_convert_delalloc related changes in git log back then. Currently I'm back on 4.19.82 on both machines for 4 days (on 5.3.x 1 day was enough to end up with fs shutdown) and no assert hits. [resend due to bad vger postmasters] -- Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )