cc linux-raid On Wed, Dec 12, 2018 at 01:29:49PM +0100, Sinisa wrote: > Hello group, > > I have noticed something strange going on lately, but recently I have come > to conclusion that there is some unwanted interaction between XFS and Linux > RAID10 with "offset" layout. > > So here is the problem: I create a Linux RAID10 mirror with 2 disks (HDD or > SSD) and "o2" layout (best choice for read and write speed): > # mdadm -C -n2 -l10 -po2 /dev/mdX /dev/sdaX /dev/sdbX > # mkfs.xfs /dev/mdX > # mount /dev/mdX /mnt > # rsync -avxDPHS / /mnt > > So we have RAID10 initializing: > > # cat /proc/mdstat > Personalities : [raid1] [raid10] > md2 : active raid10 sdb3[1] sda3[0] > 314433536 blocks super 1.2 4096K chunks 2 offset-copies [2/2] [UU] > [==>..................] resync = 11.7% (36917568/314433536) > finish=8678.2min speed=532K/sec > bitmap: 3/3 pages [12KB], 65536KB chunk > > but after a few minutes everything stops like you can see above. Rsync (or > any other process writing to that md device) also freezes. If I try to read > already copied files - freeze, usually with less that 2GB copied. > Does the same thing happen without the RAID initialization? E.g., if you wait for it to complete or (IIRC) if you create with --assume-clean? I assume the init-in-progress state is common with your tests on other filesystems? A few more notes below inline to the log.. > Sometimes in dmesg I get some kernel messages about "task kworker/2:1:55 > blocked for more than 480 seconds." (please see attached dmesg.txt and my > reports here: https://bugzilla.opensuse.org/show_bug.cgi?id=1111073), > sometimes nothing at all. When this happens, I can only reboot with SysRq-b > or "physically" with reset/power button. > > Same thing can happen with "far" layout, but it seems to me that it does not > happen every time (or that often). I might be wrong, because I never use > "far" layout in real life, only for testing. > I was unable to reproduce the failure with "near" layout. > > Also with EXT4 or BTRFS and any layout everything works just as it should, > that is sync goes on until finished, and rsync, cp, or any other write work > just fine at the same time. > > Let me just add that I first saw this behavior in openSUSE LEAP 15.0 (kernel > 4.12). In previous versions (up to kernel 4.4) I never had this problem. In > the meantime I have tested with kernels up to 4.20rc and it is the same. > Unfortunately I cannot go back to test kernels 4.5 - 4.11 to pinpoint the > moment the problem first appeared. > > > > -- > Best regards, > Siniša Bandin > (excuse my English) > > [ 180.981499] SGI XFS with ACLs, security attributes, no debug enabled > [ 181.005019] XFS (md1): Mounting V5 Filesystem > [ 181.132076] XFS (md1): Starting recovery (logdev: internal) > [ 181.295606] XFS (md1): Ending recovery (logdev: internal) > [ 181.804011] XFS (md1): Unmounting Filesystem > [ 182.201794] XFS (md127): Mounting V4 Filesystem > [ 182.736958] md: recovery of RAID array md127 > [ 182.915479] XFS (md127): Ending clean mount > [ 183.819702] XFS (md127): Unmounting Filesystem > [ 184.943831] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null) > [ 529.784557] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null) > [ 601.789958] md1: detected capacity change from 33284947968 to 0 > [ 601.789973] md: md1 stopped. > [ 602.314112] md0: detected capacity change from 550436864 to 0 > [ 602.314128] md: md0 stopped. > [ 602.745030] md: md127: recovery interrupted. > [ 603.131684] md127: detected capacity change from 966229229568 to 0 > [ 603.132237] md: md127 stopped. > [ 603.435808] sda: sda1 sda2 > [ 603.594074] udevd[5011]: inotify_add_watch(11, /dev/sda2, 10) failed: No such file or directory > [ 603.643959] sda: > [ 603.844724] sdb: sdb1 sdb2 > [ 604.255407] sdb: sdb1 > [ 604.490214] udevd[5050]: inotify_add_watch(11, /dev/sdb1, 10) failed: No such file or directory > [ 605.140952] sdb: sdb1 > [ 605.628686] sdb: sdb1 sdb2 > [ 606.271192] sdb: sdb1 sdb2 sdb3 > [ 607.079626] sdb: sdb1 sdb2 sdb3 > [ 607.611092] sda: > [ 608.273201] sda: sda1 > [ 608.611952] sda: sda1 sda2 > [ 609.031326] sda: sda1 sda2 sda3 > [ 609.753140] md/raid10:md1: not clean -- starting background reconstruction > [ 609.753145] md/raid10:md1: active with 2 out of 2 devices > [ 609.768804] md1: detected capacity change from 0 to 32210157568 > [ 609.772677] md: resync of RAID array md1 > [ 614.590107] XFS (md1): Mounting V5 Filesystem > [ 615.449035] XFS (md1): Ending clean mount > [ 617.678462] md/raid1:md0: not clean -- starting background reconstruction > [ 617.678469] md/raid1:md0: active with 2 out of 2 mirrors > [ 617.740729] md0: detected capacity change from 0 to 524222464 > [ 617.747107] md: delaying resync of md0 until md1 has finished (they share one or more physical units) What are md0 and md1? Note that I don't see md2 anywhere in this log. > [ 620.037818] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null) > [ 1463.754785] INFO: task kworker/0:3:227 blocked for more than 480 seconds. > [ 1463.754793] Not tainted 4.19.5-1-default #1 > [ 1463.754795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.754799] kworker/0:3 D 0 227 2 0x80000000 > [ 1463.755000] Workqueue: xfs-eofblocks/md1 xfs_eofblocks_worker [xfs] > [ 1463.755005] Call Trace: > [ 1463.755025] ? __schedule+0x29a/0x880 > [ 1463.755032] ? rwsem_down_write_failed+0x197/0x350 > [ 1463.755038] schedule+0x78/0x110 > [ 1463.755044] rwsem_down_write_failed+0x197/0x350 > [ 1463.755055] call_rwsem_down_write_failed+0x13/0x20 > [ 1463.755061] down_write+0x20/0x30 So we have a background task blocked on an inode lock. > [ 1463.755196] xfs_free_eofblocks+0x114/0x1a0 [xfs] > [ 1463.755330] xfs_inode_free_eofblocks+0xd3/0x1e0 [xfs] > [ 1463.755459] ? xfs_inode_ag_walk_grab+0x5b/0x90 [xfs] > [ 1463.755586] xfs_inode_ag_walk.isra.15+0x1aa/0x420 [xfs] > [ 1463.755714] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs] > [ 1463.755727] ? trace_hardirqs_on_thunk+0x1a/0x1c > [ 1463.755734] ? __switch_to_asm+0x40/0x70 > [ 1463.755738] ? __switch_to_asm+0x34/0x70 > [ 1463.755743] ? __switch_to_asm+0x40/0x70 > [ 1463.755748] ? __switch_to_asm+0x34/0x70 > [ 1463.755752] ? __switch_to_asm+0x40/0x70 > [ 1463.755757] ? __switch_to_asm+0x34/0x70 > [ 1463.755762] ? __switch_to_asm+0x40/0x70 > [ 1463.755893] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs] > [ 1463.755900] ? radix_tree_gang_lookup_tag+0xc2/0x140 > [ 1463.756032] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs] > [ 1463.756158] xfs_inode_ag_iterator_tag+0x73/0xb0 [xfs] > [ 1463.756288] xfs_eofblocks_worker+0x29/0x40 [xfs] > [ 1463.756298] process_one_work+0x1fd/0x420 > [ 1463.756305] worker_thread+0x2d/0x3d0 > [ 1463.756311] ? rescuer_thread+0x340/0x340 > [ 1463.756316] kthread+0x112/0x130 > [ 1463.756322] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.756329] ret_from_fork+0x3a/0x50 > [ 1463.756375] INFO: task kworker/u4:0:4615 blocked for more than 480 seconds. > [ 1463.756379] Not tainted 4.19.5-1-default #1 > [ 1463.756380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.756383] kworker/u4:0 D 0 4615 2 0x80000000 > [ 1463.756395] Workqueue: writeback wb_workfn (flush-9:1) > [ 1463.756400] Call Trace: > [ 1463.756409] ? __schedule+0x29a/0x880 > [ 1463.756420] ? wait_barrier+0xdd/0x170 [raid10] > [ 1463.756426] schedule+0x78/0x110 > [ 1463.756433] wait_barrier+0xdd/0x170 [raid10] > [ 1463.756440] ? wait_woken+0x80/0x80 > [ 1463.756448] raid10_write_request+0xf2/0x900 [raid10] > [ 1463.756454] ? wait_woken+0x80/0x80 > [ 1463.756459] ? mempool_alloc+0x55/0x160 > [ 1463.756483] ? md_write_start+0xa9/0x270 [md_mod] > [ 1463.756492] raid10_make_request+0xc1/0x120 [raid10] > [ 1463.756498] ? wait_woken+0x80/0x80 > [ 1463.756514] md_handle_request+0x121/0x190 [md_mod] > [ 1463.756535] md_make_request+0x78/0x190 [md_mod] > [ 1463.756544] generic_make_request+0x1c6/0x470 > [ 1463.756553] submit_bio+0x45/0x140 Writeback is blocked submitting I/O down in the MD driver. > [ 1463.756714] xfs_submit_ioend+0x9c/0x1e0 [xfs] > [ 1463.756844] xfs_vm_writepages+0x68/0x80 [xfs] > [ 1463.756856] do_writepages+0x31/0xb0 > [ 1463.756865] ? read_hpet+0x126/0x130 > [ 1463.756873] ? ktime_get+0x36/0xa0 > [ 1463.756881] __writeback_single_inode+0x3d/0x3e0 > [ 1463.756889] writeback_sb_inodes+0x1c4/0x430 > [ 1463.756902] __writeback_inodes_wb+0x5d/0xb0 > [ 1463.756910] wb_writeback+0x26b/0x310 > [ 1463.756920] wb_workfn+0x33a/0x410 > [ 1463.756932] process_one_work+0x1fd/0x420 > [ 1463.756940] worker_thread+0x2d/0x3d0 > [ 1463.756946] ? rescuer_thread+0x340/0x340 > [ 1463.756951] kthread+0x112/0x130 > [ 1463.756957] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.756965] ret_from_fork+0x3a/0x50 > [ 1463.756979] INFO: task kworker/0:2:4994 blocked for more than 480 seconds. > [ 1463.756982] Not tainted 4.19.5-1-default #1 > [ 1463.756984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.756987] kworker/0:2 D 0 4994 2 0x80000000 > [ 1463.757013] Workqueue: md submit_flushes [md_mod] > [ 1463.757016] Call Trace: > [ 1463.757024] ? __schedule+0x29a/0x880 > [ 1463.757034] ? wait_barrier+0xdd/0x170 [raid10] > [ 1463.757039] schedule+0x78/0x110 > [ 1463.757047] wait_barrier+0xdd/0x170 [raid10] > [ 1463.757054] ? wait_woken+0x80/0x80 > [ 1463.757062] raid10_write_request+0xf2/0x900 [raid10] > [ 1463.757067] ? wait_woken+0x80/0x80 > [ 1463.757072] ? mempool_alloc+0x55/0x160 > [ 1463.757088] ? md_write_start+0xa9/0x270 [md_mod] > [ 1463.757095] ? trace_hardirqs_off_thunk+0x1a/0x1c > [ 1463.757104] raid10_make_request+0xc1/0x120 [raid10] > [ 1463.757110] ? wait_woken+0x80/0x80 > [ 1463.757126] md_handle_request+0x121/0x190 [md_mod] > [ 1463.757132] ? _raw_spin_unlock_irq+0x22/0x40 > [ 1463.757137] ? finish_task_switch+0x74/0x260 > [ 1463.757156] submit_flushes+0x21/0x40 [md_mod] Some other MD task (?) also blocked submitting a request. > [ 1463.757163] process_one_work+0x1fd/0x420 > [ 1463.757170] worker_thread+0x2d/0x3d0 > [ 1463.757177] ? rescuer_thread+0x340/0x340 > [ 1463.757181] kthread+0x112/0x130 > [ 1463.757186] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.757193] ret_from_fork+0x3a/0x50 > [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds. > [ 1463.757207] Not tainted 4.19.5-1-default #1 > [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.757212] md1_resync D 0 5215 2 0x80000000 > [ 1463.757216] Call Trace: > [ 1463.757223] ? __schedule+0x29a/0x880 > [ 1463.757231] ? raise_barrier+0x8d/0x140 [raid10] > [ 1463.757236] schedule+0x78/0x110 > [ 1463.757243] raise_barrier+0x8d/0x140 [raid10] > [ 1463.757248] ? wait_woken+0x80/0x80 > [ 1463.757257] raid10_sync_request+0x1f6/0x1e30 [raid10] > [ 1463.757265] ? _raw_spin_unlock_irq+0x22/0x40 > [ 1463.757284] ? is_mddev_idle+0x125/0x137 [md_mod] > [ 1463.757302] md_do_sync.cold.78+0x404/0x969 [md_mod] The md1 sync task is blocked, I'm not sure on what. > [ 1463.757311] ? wait_woken+0x80/0x80 > [ 1463.757336] ? md_rdev_init+0xb0/0xb0 [md_mod] > [ 1463.757351] md_thread+0xe9/0x140 [md_mod] > [ 1463.757358] ? _raw_spin_unlock_irqrestore+0x2e/0x60 > [ 1463.757364] ? __kthread_parkme+0x4c/0x70 > [ 1463.757369] kthread+0x112/0x130 > [ 1463.757374] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.757380] ret_from_fork+0x3a/0x50 > [ 1463.757395] INFO: task xfsaild/md1:5233 blocked for more than 480 seconds. > [ 1463.757398] Not tainted 4.19.5-1-default #1 > [ 1463.757400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.757402] xfsaild/md1 D 0 5233 2 0x80000000 > [ 1463.757406] Call Trace: > [ 1463.757413] ? __schedule+0x29a/0x880 > [ 1463.757421] ? wait_barrier+0xdd/0x170 [raid10] > [ 1463.757426] schedule+0x78/0x110 > [ 1463.757433] wait_barrier+0xdd/0x170 [raid10] > [ 1463.757438] ? wait_woken+0x80/0x80 > [ 1463.757446] raid10_write_request+0xf2/0x900 [raid10] > [ 1463.757451] ? wait_woken+0x80/0x80 > [ 1463.757455] ? mempool_alloc+0x55/0x160 > [ 1463.757471] ? md_write_start+0xa9/0x270 [md_mod] > [ 1463.757477] ? trace_hardirqs_on_thunk+0x1a/0x1c > [ 1463.757485] raid10_make_request+0xc1/0x120 [raid10] > [ 1463.757491] ? wait_woken+0x80/0x80 > [ 1463.757507] md_handle_request+0x121/0x190 [md_mod] > [ 1463.757527] md_make_request+0x78/0x190 [md_mod] > [ 1463.757536] generic_make_request+0x1c6/0x470 > [ 1463.757544] submit_bio+0x45/0x140 xfsaild (metadata writeback) is also blocked submitting I/O down in the MD driver. > [ 1463.757552] ? bio_add_page+0x48/0x60 > [ 1463.757716] _xfs_buf_ioapply+0x2c1/0x450 [xfs] > [ 1463.757849] ? xfs_buf_delwri_submit_buffers+0xec/0x280 [xfs] > [ 1463.757974] __xfs_buf_submit+0x67/0x270 [xfs] > [ 1463.758102] xfs_buf_delwri_submit_buffers+0xec/0x280 [xfs] > [ 1463.758232] ? xfsaild+0x294/0x7e0 [xfs] > [ 1463.758364] xfsaild+0x294/0x7e0 [xfs] > [ 1463.758377] ? _raw_spin_unlock_irqrestore+0x2e/0x60 > [ 1463.758508] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] > [ 1463.758514] kthread+0x112/0x130 > [ 1463.758520] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.758527] ret_from_fork+0x3a/0x50 > [ 1463.758543] INFO: task rpm:5364 blocked for more than 480 seconds. > [ 1463.758546] Not tainted 4.19.5-1-default #1 > [ 1463.758547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.758550] rpm D 0 5364 3757 0x00000000 > [ 1463.758554] Call Trace: > [ 1463.758563] ? __schedule+0x29a/0x880 > [ 1463.758701] ? xlog_wait+0x5c/0x70 [xfs] > [ 1463.759821] schedule+0x78/0x110 > [ 1463.760022] xlog_wait+0x5c/0x70 [xfs] > [ 1463.760036] ? wake_up_q+0x70/0x70 > [ 1463.760167] __xfs_log_force_lsn+0x223/0x230 [xfs] > [ 1463.760297] ? xfs_file_fsync+0x196/0x1d0 [xfs] > [ 1463.760424] xfs_log_force_lsn+0x93/0x140 [xfs] > [ 1463.760552] xfs_file_fsync+0x196/0x1d0 [xfs] An fsync is blocked, presumably on XFS log I/O completion. > [ 1463.760562] ? __sb_end_write+0x36/0x60 > [ 1463.760571] do_fsync+0x38/0x70 > [ 1463.760578] __x64_sys_fdatasync+0x13/0x20 > [ 1463.760585] do_syscall_64+0x60/0x110 > [ 1463.760594] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 1463.760603] RIP: 0033:0x7f9757fae8a4 > [ 1463.760616] Code: Bad RIP value. > [ 1463.760619] RSP: 002b:00007fff74fdb428 EFLAGS: 00000246 ORIG_RAX: 000000000000004b > [ 1463.760654] RAX: ffffffffffffffda RBX: 0000000000000064 RCX: 00007f9757fae8a4 > [ 1463.760657] RDX: 00000000012c4c60 RSI: 00000000012cc130 RDI: 0000000000000004 > [ 1463.760660] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f9758708c00 > [ 1463.760662] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000012cc130 > [ 1463.760665] R13: 000000000123a3a0 R14: 0000000000010830 R15: 0000000000000062 > [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds. > [ 1463.760683] Not tainted 4.19.5-1-default #1 > [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1463.760687] kworker/0:8 D 0 5367 2 0x80000000 > [ 1463.760718] Workqueue: md submit_flushes [md_mod] And that MD submit_flushes thing again. Not to say there isn't some issue between XFS and MD going on here, but I think we might want an MD person to take a look at this and possibly provide some insight. From an XFS perspective, this all just looks like we're blocked on I/O (via writeback, AIL and log) to a slow device. Brian > [ 1463.760721] Call Trace: > [ 1463.760731] ? __schedule+0x29a/0x880 > [ 1463.760741] ? wait_barrier+0xdd/0x170 [raid10] > [ 1463.760746] schedule+0x78/0x110 > [ 1463.760753] wait_barrier+0xdd/0x170 [raid10] > [ 1463.760761] ? wait_woken+0x80/0x80 > [ 1463.760768] raid10_write_request+0xf2/0x900 [raid10] > [ 1463.760774] ? wait_woken+0x80/0x80 > [ 1463.760778] ? mempool_alloc+0x55/0x160 > [ 1463.760795] ? md_write_start+0xa9/0x270 [md_mod] > [ 1463.760801] ? try_to_wake_up+0x44/0x470 > [ 1463.760810] raid10_make_request+0xc1/0x120 [raid10] > [ 1463.760816] ? wait_woken+0x80/0x80 > [ 1463.760831] md_handle_request+0x121/0x190 [md_mod] > [ 1463.760851] md_make_request+0x78/0x190 [md_mod] > [ 1463.760860] generic_make_request+0x1c6/0x470 > [ 1463.760870] raid10_write_request+0x77a/0x900 [raid10] > [ 1463.760875] ? wait_woken+0x80/0x80 > [ 1463.760879] ? mempool_alloc+0x55/0x160 > [ 1463.760895] ? md_write_start+0xa9/0x270 [md_mod] > [ 1463.760904] raid10_make_request+0xc1/0x120 [raid10] > [ 1463.760910] ? wait_woken+0x80/0x80 > [ 1463.760926] md_handle_request+0x121/0x190 [md_mod] > [ 1463.760931] ? _raw_spin_unlock_irq+0x22/0x40 > [ 1463.760936] ? finish_task_switch+0x74/0x260 > [ 1463.760954] submit_flushes+0x21/0x40 [md_mod] > [ 1463.760962] process_one_work+0x1fd/0x420 > [ 1463.760970] worker_thread+0x2d/0x3d0 > [ 1463.760976] ? rescuer_thread+0x340/0x340 > [ 1463.760981] kthread+0x112/0x130 > [ 1463.760986] ? kthread_create_worker_on_cpu+0x40/0x40 > [ 1463.760992] ret_from_fork+0x3a/0x50