* Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote: > * Theodore Ts'o (tytso@xxxxxxx) wrote: > > On Mon, Aug 14, 2023 at 09:02:53PM +0000, Dr. David Alan Gilbert wrote: > > > dg 29594 29592 0 18:40 pts/0 00:00:00 /usr/bin/ar --plugin /usr/libexec/gcc/x86_64-redhat-linux/13/liblto_plugin.so -csrDT src/intel/perf/libintel_perf.a src/intel/perf/libintel_perf.a.p/meson-generated_.._intel_perf_metrics.c.o src/intel/perf/libintel_perf.a.p/intel_perf.c.o src/intel/perf/libintel_perf.a.p/intel_perf_query.c.o src/intel/perf/libintel_perf.a.p/intel_perf_mdapi.c.o > > > > > > [root@dalek dg]# cat /proc/29594/stack > > > [<0>] md_super_wait+0xa2/0xe0 > > > [<0>] md_bitmap_unplug+0xd2/0x120 > > > [<0>] flush_bio_list+0xf3/0x100 [raid1] > > > [<0>] raid1_unplug+0x3b/0xb0 [raid1] > > > [<0>] __blk_flush_plug+0xd7/0x150 > > > [<0>] blk_finish_plug+0x29/0x40 > > > [<0>] ext4_do_writepages+0x401/0xc90 > > > [<0>] ext4_writepages+0xad/0x180 > > > > If you want a few seconds and try grabbing cat /proc/29594/stack > > again, what does the stack trace stay consistent as above? > > I'll get back to that and retry it. Yeh, the stack is consistent; this time around it's an 'ar' in a kernel build: [root@dalek dg]# cat /proc/17970/stack [<0>] md_super_wait+0xa2/0xe0 [<0>] md_bitmap_unplug+0xad/0x120 [<0>] flush_bio_list+0xf3/0x100 [raid1] [<0>] raid1_unplug+0x3b/0xb0 [raid1] [<0>] __blk_flush_plug+0xd7/0x150 [<0>] blk_finish_plug+0x29/0x40 [<0>] ext4_do_writepages+0x401/0xc90 [<0>] ext4_writepages+0xad/0x180 [<0>] do_writepages+0xd2/0x1e0 [<0>] filemap_fdatawrite_wbc+0x63/0x90 [<0>] __filemap_fdatawrite_range+0x5c/0x80 [<0>] ext4_release_file+0x74/0xb0 [<0>] __fput+0xf5/0x2a0 [<0>] task_work_run+0x5d/0x90 [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0 [<0>] syscall_exit_to_user_mode+0x1b/0x40 [<0>] do_syscall_64+0x6c/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [root@dalek dg]# cat /proc/17970/stack [<0>] md_super_wait+0xa2/0xe0 [<0>] md_bitmap_unplug+0xad/0x120 [<0>] flush_bio_list+0xf3/0x100 [raid1] [<0>] raid1_unplug+0x3b/0xb0 [raid1] [<0>] __blk_flush_plug+0xd7/0x150 [<0>] blk_finish_plug+0x29/0x40 [<0>] ext4_do_writepages+0x401/0xc90 [<0>] ext4_writepages+0xad/0x180 [<0>] do_writepages+0xd2/0x1e0 [<0>] filemap_fdatawrite_wbc+0x63/0x90 [<0>] __filemap_fdatawrite_range+0x5c/0x80 [<0>] ext4_release_file+0x74/0xb0 [<0>] __fput+0xf5/0x2a0 [<0>] task_work_run+0x5d/0x90 [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0 [<0>] syscall_exit_to_user_mode+0x1b/0x40 [<0>] do_syscall_64+0x6c/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > Also, if you have iostat installed (usually part of the sysstat > > package), does "iostat 1" show any I/O activity on the md device? iostat is showing something odd, most devices are at 0, except for 3 of the dm's that are stuck at 100% utilisation with apparently nothing going on: avg-cpu: %user %nice %system %iowait %steal %idle 0.06 0.00 0.03 53.06 0.00 46.84 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util ... dm-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 dm-17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 dm-18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 .... dm-20 is the /dev/mapper/main-more which is the RAID on which the fs runs, 16 and 17 are main-more_rmeta_0 and main-more_rimage_0 so something screwy is going on there. Dave > > What about the underying block dvices used by the md device? If the > > md device is attached to HDD's where you can see the activity light, > > can you see (or hear) any disk activity? > > It's spinning rust, and I hear them go quiet when the hang happens. > > Dave > > > This sure seems like either the I/O driver isn't processing requests, > > or some kind of hang in the md layer.... > > > > - Ted > -- > -----Open up your eyes, open up your mind, open up your code ------- > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ > \ dave @ treblig.org | | In Hex / > \ _________________________|_____ http://www.treblig.org |_______/ -- -----Open up your eyes, open up your mind, open up your code ------- / Dr. David Alan Gilbert | Running GNU/Linux | Happy \ \ dave @ treblig.org | | In Hex / \ _________________________|_____ http://www.treblig.org |_______/