(Copying in Christoph and Jens) * Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote: > * Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote: > > * Theodore Ts'o (tytso@xxxxxxx) wrote: > > > On Mon, Aug 14, 2023 at 09:02:53PM +0000, Dr. David Alan Gilbert wrote: > > > > dg 29594 29592 0 18:40 pts/0 00:00:00 /usr/bin/ar --plugin /usr/libexec/gcc/x86_64-redhat-linux/13/liblto_plugin.so -csrDT src/intel/perf/libintel_perf.a src/intel/perf/libintel_perf.a.p/meson-generated_.._intel_perf_metrics.c.o src/intel/perf/libintel_perf.a.p/intel_perf.c.o src/intel/perf/libintel_perf.a.p/intel_perf_query.c.o src/intel/perf/libintel_perf.a.p/intel_perf_mdapi.c.o > > > > > > > > [root@dalek dg]# cat /proc/29594/stack > > > > [<0>] md_super_wait+0xa2/0xe0 > > > > [<0>] md_bitmap_unplug+0xd2/0x120 > > > > [<0>] flush_bio_list+0xf3/0x100 [raid1] > > > > [<0>] raid1_unplug+0x3b/0xb0 [raid1] > > > > [<0>] __blk_flush_plug+0xd7/0x150 > > > > [<0>] blk_finish_plug+0x29/0x40 > > > > [<0>] ext4_do_writepages+0x401/0xc90 > > > > [<0>] ext4_writepages+0xad/0x180 > > > > > > If you want a few seconds and try grabbing cat /proc/29594/stack > > > again, what does the stack trace stay consistent as above? > > > > I'll get back to that and retry it. > > Yeh, the stack is consistent; this time around it's an 'ar' in a kernel > build: > > [root@dalek dg]# cat /proc/17970/stack > [<0>] md_super_wait+0xa2/0xe0 > [<0>] md_bitmap_unplug+0xad/0x120 > [<0>] flush_bio_list+0xf3/0x100 [raid1] > [<0>] raid1_unplug+0x3b/0xb0 [raid1] > [<0>] __blk_flush_plug+0xd7/0x150 > [<0>] blk_finish_plug+0x29/0x40 > [<0>] ext4_do_writepages+0x401/0xc90 > [<0>] ext4_writepages+0xad/0x180 > [<0>] do_writepages+0xd2/0x1e0 > [<0>] filemap_fdatawrite_wbc+0x63/0x90 > [<0>] __filemap_fdatawrite_range+0x5c/0x80 > [<0>] ext4_release_file+0x74/0xb0 > [<0>] __fput+0xf5/0x2a0 > [<0>] task_work_run+0x5d/0x90 > [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0 > [<0>] syscall_exit_to_user_mode+0x1b/0x40 > [<0>] do_syscall_64+0x6c/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > [root@dalek dg]# cat /proc/17970/stack > [<0>] md_super_wait+0xa2/0xe0 > [<0>] md_bitmap_unplug+0xad/0x120 > [<0>] flush_bio_list+0xf3/0x100 [raid1] > [<0>] raid1_unplug+0x3b/0xb0 [raid1] > [<0>] __blk_flush_plug+0xd7/0x150 > [<0>] blk_finish_plug+0x29/0x40 > [<0>] ext4_do_writepages+0x401/0xc90 > [<0>] ext4_writepages+0xad/0x180 > [<0>] do_writepages+0xd2/0x1e0 > [<0>] filemap_fdatawrite_wbc+0x63/0x90 > [<0>] __filemap_fdatawrite_range+0x5c/0x80 > [<0>] ext4_release_file+0x74/0xb0 > [<0>] __fput+0xf5/0x2a0 > [<0>] task_work_run+0x5d/0x90 > [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0 > [<0>] syscall_exit_to_user_mode+0x1b/0x40 > [<0>] do_syscall_64+0x6c/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > > Also, if you have iostat installed (usually part of the sysstat > > > package), does "iostat 1" show any I/O activity on the md device? > > iostat is showing something odd, most devices are at 0, > except for 3 of the dm's that are stuck at 100% utilisation with > apparently nothing going on: > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.06 0.00 0.03 53.06 0.00 46.84 > > Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util > ... > dm-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > dm-17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > dm-18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > dm-19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > dm-20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 > .... > > dm-20 is the /dev/mapper/main-more which is the RAID on which the > fs runs, 16 and 17 are main-more_rmeta_0 and main-more_rimage_0 > so something screwy is going on there. I've just finished a bisect of this hang, and got to: 615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7 Author: Christoph Hellwig <hch@xxxxxx> Date: Fri May 19 06:40:48 2023 +0200 blk-mq: defer to the normal submission path for post-flush requests Requests with the FUA bit on hardware without FUA support need a post flush before returning to the caller, but they can still be sent using the normal I/O path after initializing the flush-related fields and end I/O handler. Signed-off-by: Christoph Hellwig <hch@xxxxxx> Reviewed-by: Bart Van Assche <bvanassche@xxxxxxx> Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@xxxxxx Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> the full set of runs is: 2 good: 6995e2de6891c724bfeb2db33d7b87775f913ad1 v6.4 7 good 0aa69d53ac7c30f6184f88f2e310d808b32b35a5 6.4.0dg-00428-g0aa69d53ac7c 12 good c1075e548ce6e6b5c7b71f2b05d344164ebc52bb 6.4.0-rc2dg-00020-gc1075e548ce6 15 good be4c427809b0a746aff54dbb8ef663f0184291d0 6.4.0-rc2dg-00022-gbe4c427809b0 17 bad 615939a2ae734e3e68c816d6749d1f5f79c62ab7 6.4.0-rc2dg-00023-g615939a2ae73 16 bad 1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 6.4.0-rc2dg-00024-g1e82fadfc6b9 14 bad 9a67aa52a42b31ad44220cc218df3b75a5cd5d05 6.4.0-rc2dg-00025-g9a67aa52a42b 13 bad 38f2dd34410f9070b60969a07ff7d8743b4fd56c 6.4.0-rc2dg-00030-g38f2dd34410f 11 bad a450f49708ea2ccabd1c5d2fe8a702ca5ef77941 6.4.0-rc2dg-00071-ga450f49708ea 10 bad a4f75764d16bed317276b05a9fe2c179ef61680d 6.4.0-rc2dg-00112-ga4f75764d16b 9 bad 658afed19ceed54a52b9e9e69c0791c8868ff55d 6.4.0-rc2dg-00193-g658afed19cee 8 bad 7cffdbe3607a6cc2dc02d135e13732ec36bc4e28 6.4.0dg-00752-g7cffdbe3607a 6 bad 2605e80d3438c77190f55b821c6575048c68268e 6.4.0dg-01094-g2605e80d3438 5 bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0 6.4.0dg-02143-g6e17c6de3ddf 4 bad 3a8a670eeeaa40d87bd38a587438952741980c18 6.4.0dg-04247-g3a8a670eeeaa 3 bad 1b722407a13b7f8658d2e26917791f32805980a2 6.4.0dg-07066-g1b722407a13b 1 bad: 52a93d39b17dc v6.5-rc5 a sysrq-w from build 16 ( 1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 ) s below. This is very repeatable for me; a) boot into desktop b) Start rhythmbox playing c) Open 2 terminals d) In one start make clean on kernel tree e) in other start iostat -x 1 f) Once the make clean is done, make -j 32 g) If that succeeds try ag somejunk in kernel tree (i.e. parallel grep) It's not clear to me if starting rhythmbox is actually necessary; but with that recipe it often fails in a few seconds. Let me know if there's anything you'd like me to try. Dave [ 90.614548] sysrq: Show Blocked State [ 90.615300] task:kworker/u64:3 state:D stack:0 pid:232 ppid:2 flags:0x00004000 [ 90.615307] Workqueue: writeback wb_workfn (flush-253:20) [ 90.615315] Call Trace: [ 90.615317] <TASK> [ 90.615321] __schedule+0x3eb/0x14c0 [ 90.615327] ? update_sd_lb_stats.constprop.0+0x135/0x8d0 [ 90.615334] schedule+0x5e/0xd0 [ 90.615336] io_schedule+0x46/0x70 [ 90.615338] bit_wait_io+0x11/0x70 [ 90.615341] __wait_on_bit+0x46/0x140 [ 90.615343] ? __pfx_bit_wait_io+0x10/0x10 [ 90.615346] out_of_line_wait_on_bit+0x95/0xc0 [ 90.615349] ? __pfx_wake_bit_function+0x10/0x10 [ 90.615353] do_get_write_access+0x266/0x410 [ 90.615358] jbd2_journal_get_write_access+0x5f/0x80 [ 90.615362] __ext4_journal_get_write_access+0x85/0x180 [ 90.615366] ? ext4_get_group_desc+0x5c/0xe0 [ 90.615369] ext4_mb_mark_diskspace_used+0x10d/0x460 [ 90.615374] ext4_mb_new_blocks+0x18b/0x1000 [ 90.615377] ? __kmalloc+0x4d/0x150 [ 90.615381] ? ext4_find_extent+0x3c6/0x420 [ 90.615385] ext4_ext_map_blocks+0x60d/0x1980 [ 90.615389] ? allocate_slab+0x251/0x4c0 [ 90.615393] ? ext4_alloc_io_end_vec+0x23/0x80 [ 90.615397] ? mpage_prepare_extent_to_map+0x454/0x490 [ 90.615401] ext4_map_blocks+0x251/0x620 [ 90.615404] ext4_do_writepages+0x76a/0xc90 [ 90.615408] ext4_writepages+0xad/0x180 [ 90.615414] do_writepages+0xd2/0x1e0 [ 90.615418] ? __wb_calc_thresh+0x4b/0x130 [ 90.615421] __writeback_single_inode+0x3d/0x360 [ 90.615424] ? wbc_detach_inode+0x101/0x220 [ 90.615427] writeback_sb_inodes+0x1ed/0x4b0 [ 90.615432] __writeback_inodes_wb+0x4c/0xf0 [ 90.615435] wb_writeback+0x172/0x2f0 [ 90.615439] wb_workfn+0x2b5/0x510 [ 90.615442] ? __schedule+0x3f3/0x14c0 [ 90.615444] ? __mod_timer+0x11f/0x370 [ 90.615449] process_one_work+0x1c7/0x3d0 [ 90.615453] worker_thread+0x51/0x390 [ 90.615456] ? __pfx_worker_thread+0x10/0x10 [ 90.615458] kthread+0xe8/0x120 [ 90.615462] ? __pfx_kthread+0x10/0x10 [ 90.615466] ret_from_fork+0x2c/0x50 [ 90.615472] </TASK> [ 90.615613] task:kworker/u64:4 state:D stack:0 pid:290 ppid:2 flags:0x00004000 [ 90.615617] Workqueue: writeback wb_workfn (flush-253:20) [ 90.615621] Call Trace: [ 90.615622] <TASK> [ 90.615623] __schedule+0x3eb/0x14c0 [ 90.615626] ? __submit_bio+0x88/0x170 [ 90.615629] ? submit_bio_noacct_nocheck+0x159/0x370 [ 90.615632] schedule+0x5e/0xd0 [ 90.615635] md_super_wait+0xa2/0xe0 [ 90.615640] ? __pfx_autoremove_wake_function+0x10/0x10 [ 90.615644] md_bitmap_unplug+0xf3/0x120 [ 90.615650] flush_bio_list+0x1e/0xe0 [raid1] [ 90.615659] raid1_unplug+0xaf/0xd0 [raid1] [ 90.615665] __blk_flush_plug+0xdb/0x160 [ 90.615669] blk_finish_plug+0x29/0x40 [ 90.615671] wb_writeback+0x2bc/0x2f0 [ 90.615674] wb_workfn+0x2b5/0x510 [ 90.615677] ? ttwu_queue_wakelist+0xef/0x110 [ 90.615681] process_one_work+0x1c7/0x3d0 [ 90.615684] worker_thread+0x51/0x390 [ 90.615687] ? __pfx_worker_thread+0x10/0x10 [ 90.615689] kthread+0xe8/0x120 [ 90.615692] ? __pfx_kthread+0x10/0x10 [ 90.615696] ret_from_fork+0x2c/0x50 [ 90.615701] </TASK> [ 90.616150] task:mdX_raid1 state:D stack:0 pid:1425 ppid:2 flags:0x00004000 [ 90.616153] Call Trace: [ 90.616154] <TASK> [ 90.616155] __schedule+0x3eb/0x14c0 [ 90.616157] ? __schedule+0x3f3/0x14c0 [ 90.616161] schedule+0x5e/0xd0 [ 90.616163] md_super_wait+0xa2/0xe0 [ 90.616167] ? __pfx_autoremove_wake_function+0x10/0x10 [ 90.616171] md_bitmap_daemon_work+0x211/0x3a0 [ 90.616175] ? __pfx_md_thread+0x10/0x10 [ 90.616179] md_check_recovery+0x42/0x5a0 [ 90.616182] raid1d+0x87/0x16d0 [raid1] [ 90.616188] ? __pfx_md_thread+0x10/0x10 [ 90.616191] ? __schedule+0x3f3/0x14c0 [ 90.616193] ? psi_task_switch+0x84/0x230 [ 90.616198] ? lock_timer_base+0x61/0x80 [ 90.616202] ? prepare_to_wait_event+0x60/0x180 [ 90.616206] ? __pfx_md_thread+0x10/0x10 [ 90.616209] md_thread+0xae/0x190 [ 90.616212] ? __pfx_autoremove_wake_function+0x10/0x10 [ 90.616217] kthread+0xe8/0x120 [ 90.616220] ? __pfx_kthread+0x10/0x10 [ 90.616223] ret_from_fork+0x2c/0x50 [ 90.616228] </TASK> [ 90.616310] task:jbd2/dm-20-8 state:D stack:0 pid:1501 ppid:2 flags:0x00004000 [ 90.616313] Call Trace: [ 90.616314] <TASK> [ 90.616315] __schedule+0x3eb/0x14c0 [ 90.616317] ? mempool_alloc+0x89/0x1b0 [ 90.616322] schedule+0x5e/0xd0 [ 90.616324] md_super_wait+0xa2/0xe0 [ 90.616328] ? __pfx_autoremove_wake_function+0x10/0x10 [ 90.616332] md_bitmap_unplug+0xba/0x120 [ 90.616336] flush_bio_list+0x1e/0xe0 [raid1] [ 90.616342] raid1_unplug+0xaf/0xd0 [raid1] [ 90.616349] __blk_flush_plug+0xdb/0x160 [ 90.616352] blk_finish_plug+0x29/0x40 [ 90.616354] jbd2_journal_commit_transaction+0xfdb/0x1a00 [ 90.616362] kjournald2+0xad/0x280 [ 90.616367] ? __pfx_autoremove_wake_function+0x10/0x10 [ 90.616372] ? __pfx_kjournald2+0x10/0x10 [ 90.616375] kthread+0xe8/0x120 [ 90.616378] ? __pfx_kthread+0x10/0x10 [ 90.616382] ret_from_fork+0x2c/0x50 [ 90.616387] </TASK> [ 90.618244] task:oggdemux0:sink state:D stack:0 pid:3216 ppid:2103 flags:0x00000002 [ 90.618248] Call Trace: [ 90.618249] <TASK> [ 90.618251] __schedule+0x3eb/0x14c0 [ 90.618254] schedule+0x5e/0xd0 [ 90.618256] io_schedule+0x46/0x70 [ 90.618259] folio_wait_bit_common+0x13d/0x350 [ 90.618262] ? __pfx_wake_page_function+0x10/0x10 [ 90.618265] filemap_get_pages+0x5fe/0x630 [ 90.618268] ? psi_task_switch+0xd6/0x230 [ 90.618272] ? finish_task_switch.isra.0+0x94/0x2f0 [ 90.618276] filemap_read+0xdf/0x350 [ 90.618279] ? dput+0x3a/0x310 [ 90.618284] vfs_read+0x1f6/0x320 [ 90.618289] ksys_read+0x6f/0xf0 [ 90.618291] do_syscall_64+0x60/0x90 [ 90.618295] ? exit_to_user_mode_prepare+0x142/0x1f0 [ 90.618298] ? syscall_exit_to_user_mode+0x1b/0x40 [ 90.618302] ? do_syscall_64+0x6c/0x90 [ 90.618305] ? exit_to_user_mode_prepare+0x142/0x1f0 [ 90.618307] ? syscall_exit_to_user_mode+0x1b/0x40 [ 90.618310] ? do_syscall_64+0x6c/0x90 [ 90.618312] ? do_syscall_64+0x6c/0x90 [ 90.618314] ? do_syscall_64+0x6c/0x90 [ 90.618316] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 90.618321] RIP: 0033:0x7f2b3f9083ba [ 90.618346] RSP: 002b:00007f2b179fe180 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 90.618349] RAX: ffffffffffffffda RBX: 0000559e098dfd20 RCX: 00007f2b3f9083ba [ 90.618351] RDX: 00000000000025ae RSI: 00007f2b04038190 RDI: 0000000000000015 [ 90.618352] RBP: 00007f2b179fe1a0 R08: 0000000000000000 R09: 00007f2b3f1e16a0 [ 90.618354] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000025ae [ 90.618355] R13: 0000000000000000 R14: 00007f2b179ff648 R15: 00000000000025ae [ 90.618358] </TASK> [ 90.618426] task:gcc state:D stack:0 pid:6486 ppid:6485 flags:0x00000006 [ 90.618429] Call Trace: [ 90.618430] <TASK> [ 90.618431] __schedule+0x3eb/0x14c0 [ 90.618435] schedule+0x5e/0xd0 [ 90.618437] io_schedule+0x46/0x70 [ 90.618440] folio_wait_bit_common+0x13d/0x350 [ 90.618443] ? __pfx_wake_page_function+0x10/0x10 [ 90.618445] filemap_get_pages+0x5fe/0x630 [ 90.618449] ? path_openat+0xa2c/0x1140 [ 90.618454] filemap_read+0xdf/0x350 [ 90.618457] ? do_filp_open+0xb3/0x160 [ 90.618461] vfs_read+0x1f6/0x320 [ 90.618465] ksys_read+0x6f/0xf0 [ 90.618467] do_syscall_64+0x60/0x90 [ 90.618469] ? exc_page_fault+0x7f/0x180 [ 90.618472] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 90.618476] RIP: 0033:0x7efee8b08381 [ 90.618480] RSP: 002b:00007ffc7e238648 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 90.618482] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efee8b08381 [ 90.618483] RDX: 0000000000000400 RSI: 000055b8799377e0 RDI: 0000000000000005 [ 90.618484] RBP: 00007ffc7e238770 R08: 0000000000000400 R09: 00007efee88442c0 [ 90.618486] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400 [ 90.618487] R13: 00007ffc7e238810 R14: 00007ffc7e238690 R15: 00007ffc7e238680 [ 90.618490] </TASK> > Dave > > > > > What about the underying block dvices used by the md device? If the > > > md device is attached to HDD's where you can see the activity light, > > > can you see (or hear) any disk activity? > > > > It's spinning rust, and I hear them go quiet when the hang happens. > > > > Dave > > > > > This sure seems like either the I/O driver isn't processing requests, > > > or some kind of hang in the md layer.... > > > > > > - Ted > > -- > > -----Open up your eyes, open up your mind, open up your code ------- > > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ > > \ dave @ treblig.org | | In Hex / > > \ _________________________|_____ http://www.treblig.org |_______/ > -- > -----Open up your eyes, open up your mind, open up your code ------- > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ > \ dave @ treblig.org | | In Hex / > \ _________________________|_____ http://www.treblig.org |_______/ -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ \ dave @ treblig.org | | In Hex / \ _________________________|_____ http://www.treblig.org |_______/