Re: XFS and RAID10 with o2 layout

Sinisa <sinisa@xxxxxxx> · Thu, 13 Dec 2018 14:02:13 +0100

On 12/13/18 1:28 PM, Brian Foster wrote:
On Thu, Dec 13, 2018 at 09:21:18AM +0100, Sinisa wrote:
Thanks for a quick reply. Replies are inline...

On 12.12.2018 15:30, Brian Foster wrote:
cc linux-raid

On Wed, Dec 12, 2018 at 01:29:49PM +0100, Sinisa wrote:
Hello group,

I have noticed something strange going on lately, but recently I have come
to conclusion that there is some unwanted interaction between XFS and Linux
RAID10 with "offset" layout.

So here is the problem: I create a Linux RAID10 mirror with 2 disks (HDD or
SSD) and "o2" layout (best choice for read and write speed):
# mdadm -C -n2 -l10 -po2 /dev/mdX /dev/sdaX /dev/sdbX
# mkfs.xfs /dev/mdX
# mount /dev/mdX /mnt
# rsync -avxDPHS / /mnt

So we have RAID10 initializing:

# cat /proc/mdstat
Personalities : [raid1] [raid10]
md2 : active raid10 sdb3[1] sda3[0]
       314433536 blocks super 1.2 4096K chunks 2 offset-copies [2/2] [UU]
       [==>..................]  resync = 11.7% (36917568/314433536)
finish=8678.2min speed=532K/sec
       bitmap: 3/3 pages [12KB], 65536KB chunk

but after a few minutes everything stops like you can see above. Rsync (or
any other process writing to that md device) also freezes. If I try to read
already copied files - freeze, usually with less that 2GB copied.

Does the same thing happen without the RAID initialization? E.g., if you
wait for it to complete or (IIRC) if you create with --assume-clean? I
assume the init-in-progress state is common with your tests on other
filesystems?

No, if I wait for RAID to finish initializing, or create it with
--assume-clean, everything works just fine.

Actualy, ever since openSUSE LEAP 15.0 release I have been doing just that:
pause installation process until initialization is done, then let it go on.

But recently it has happened so that I had to replace one of the disks in a
"live" system (small file server), and was unable to do that on multiple
tries during work hours because of this problem. When I waited until
afternoon, when nobody was working/writing, resync was able to finish...

So apparently there is some kind of poor interaction here with the
internal MD resync code. It's not clear to me whether it's a lockup or
extreme slowdown, but unless anybody else has ideas I'd suggest to
solicit feedback from the MD devs (note that you dropped the linux-raid
cc) as to why this set of I/O might be blocked in the raid device and go
from there.

Brian

(Sorry, I'm not very much into mailing lists, I have added linux-raid back to cc)

Today I tried lowering the RAID10 chunk size to 512 bytes (default), and only 
difference was that freeze appeared much faster.
Also tried with newest RC kernel 4.20.0-rc6-2.g91eea17-default (from openSUSE 
kernel/HEAD), with same results.

And it is definitely a lockup, because I have tried to leave it overnight, but 
once rsync/copy/write stops, it never moves on, and also once RAID sync stops 
it never moves on...

I was trying many times to get that dmesg report again, but without success 
(waited up to 30 minutes). Any help is welcome...

What I did not try is some other distribution, only openSUSE LEAP and Tumbleweed

A few more notes below inline to the log..

Sometimes in dmesg I get some kernel messages about "task kworker/2:1:55
blocked for more than 480 seconds." (please see attached dmesg.txt and my
reports here: https://bugzilla.opensuse.org/show_bug.cgi?id=1111073),
sometimes nothing at all. When this happens, I can only reboot with SysRq-b
or "physically" with reset/power button.

Same thing can happen with "far" layout, but it seems to me that it does not
happen every time (or that often). I might be wrong, because I never use
"far" layout in real life, only for testing.
I was unable to reproduce the failure with "near" layout.

Also with EXT4 or BTRFS and any layout everything works just as it should,
that is sync goes on until finished, and rsync, cp, or any other write work
just fine at the same time.

Let me just add that I first saw this behavior in openSUSE LEAP 15.0 (kernel
4.12). In previous versions (up to kernel 4.4) I never had this problem. In
the meantime I have tested with kernels up to 4.20rc and it is the same.
Unfortunately I cannot go back to test kernels 4.5 - 4.11 to pinpoint the
moment the problem first appeared.

--
Best regards,
Siniša Bandin
(excuse my English)

[ 180.981499] SGI XFS with ACLs, security attributes, no debug enabled
[ 181.005019] XFS (md1): Mounting V5 Filesystem
[ 181.132076] XFS (md1): Starting recovery (logdev: internal)
[ 181.295606] XFS (md1): Ending recovery (logdev: internal)
[ 181.804011] XFS (md1): Unmounting Filesystem
[ 182.201794] XFS (md127): Mounting V4 Filesystem
[ 182.736958] md: recovery of RAID array md127
[ 182.915479] XFS (md127): Ending clean mount
[ 183.819702] XFS (md127): Unmounting Filesystem
[ 184.943831] EXT4-fs (md0): mounted filesystem with ordered data
mode. Opts: (null)
[ 529.784557] EXT4-fs (md0): mounted filesystem with ordered data
mode. Opts: (null)
[ 601.789958] md1: detected capacity change from 33284947968 to 0
[ 601.789973] md: md1 stopped.
[ 602.314112] md0: detected capacity change from 550436864 to 0
[ 602.314128] md: md0 stopped.
[ 602.745030] md: md127: recovery interrupted.
[ 603.131684] md127: detected capacity change from 966229229568 to 0
[ 603.132237] md: md127 stopped.
[ 603.435808] sda: sda1 sda2
[ 603.594074] udevd[5011]: inotify_add_watch(11, /dev/sda2, 10)
failed: No such file or directory
[ 603.643959] sda:
[ 603.844724] sdb: sdb1 sdb2
[ 604.255407] sdb: sdb1
[ 604.490214] udevd[5050]: inotify_add_watch(11, /dev/sdb1, 10)
failed: No such file or directory
[ 605.140952] sdb: sdb1
[ 605.628686] sdb: sdb1 sdb2
[ 606.271192] sdb: sdb1 sdb2 sdb3
[ 607.079626] sdb: sdb1 sdb2 sdb3
[ 607.611092] sda:
[ 608.273201] sda: sda1
[ 608.611952] sda: sda1 sda2
[ 609.031326] sda: sda1 sda2 sda3
[ 609.753140] md/raid10:md1: not clean -- starting background reconstruction
[ 609.753145] md/raid10:md1: active with 2 out of 2 devices
[ 609.768804] md1: detected capacity change from 0 to 32210157568
[ 609.772677] md: resync of RAID array md1
[ 614.590107] XFS (md1): Mounting V5 Filesystem
[ 615.449035] XFS (md1): Ending clean mount
[ 617.678462] md/raid1:md0: not clean -- starting background reconstruction
[ 617.678469] md/raid1:md0: active with 2 out of 2 mirrors
[ 617.740729] md0: detected capacity change from 0 to 524222464
[ 617.747107] md: delaying resync of md0 until md1 has finished
(they share one or more physical units)
What are md0 and md1? Note that I don't see md2 anywhere in this log.

Sorry that I did not clarify that immediately, this log was taken earlier,
during installation, when I got to see it in dmesg.
md0 was /boot (with EXT4), md1 was / with XFS.

Example of cat /proc/mdstat was taken later, when I brought up the system
(by changing md1 to "near" layout at install time). So wherever you see md1
or md2, you can assume they are the same thing: new RAID10/o2 being written
to during initialization. But second time there was nothing in dmesg, so I
could not attach that.

[ 620.037818] EXT4-fs (md0): mounted filesystem with ordered data
mode. Opts: (null)
[ 1463.754785] INFO: task kworker/0:3:227 blocked for more than 480 seconds.
[ 1463.754793] Not tainted 4.19.5-1-default #1
[ 1463.754795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.754799] kworker/0:3 D 0 227 2 0x80000000
[ 1463.755000] Workqueue: xfs-eofblocks/md1 xfs_eofblocks_worker [xfs]
[ 1463.755005] Call Trace:
[ 1463.755025] ? __schedule+0x29a/0x880
[ 1463.755032] ? rwsem_down_write_failed+0x197/0x350
[ 1463.755038] schedule+0x78/0x110
[ 1463.755044] rwsem_down_write_failed+0x197/0x350
[ 1463.755055] call_rwsem_down_write_failed+0x13/0x20
[ 1463.755061] down_write+0x20/0x30
So we have a background task blocked on an inode lock.

[ 1463.755196] xfs_free_eofblocks+0x114/0x1a0 [xfs]
[ 1463.755330] xfs_inode_free_eofblocks+0xd3/0x1e0 [xfs]
[ 1463.755459] ? xfs_inode_ag_walk_grab+0x5b/0x90 [xfs]
[ 1463.755586] xfs_inode_ag_walk.isra.15+0x1aa/0x420 [xfs]
[ 1463.755714] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs]
[ 1463.755727] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1463.755734] ? __switch_to_asm+0x40/0x70
[ 1463.755738] ? __switch_to_asm+0x34/0x70
[ 1463.755743] ? __switch_to_asm+0x40/0x70
[ 1463.755748] ? __switch_to_asm+0x34/0x70
[ 1463.755752] ? __switch_to_asm+0x40/0x70
[ 1463.755757] ? __switch_to_asm+0x34/0x70
[ 1463.755762] ? __switch_to_asm+0x40/0x70
[ 1463.755893] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs]
[ 1463.755900] ? radix_tree_gang_lookup_tag+0xc2/0x140
[ 1463.756032] ? __xfs_inode_clear_blocks_tag+0x120/0x120 [xfs]
[ 1463.756158] xfs_inode_ag_iterator_tag+0x73/0xb0 [xfs]
[ 1463.756288] xfs_eofblocks_worker+0x29/0x40 [xfs]
[ 1463.756298] process_one_work+0x1fd/0x420
[ 1463.756305] worker_thread+0x2d/0x3d0
[ 1463.756311] ? rescuer_thread+0x340/0x340
[ 1463.756316] kthread+0x112/0x130
[ 1463.756322] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.756329] ret_from_fork+0x3a/0x50
[ 1463.756375] INFO: task kworker/u4:0:4615 blocked for more than 480 seconds.
[ 1463.756379] Not tainted 4.19.5-1-default #1
[ 1463.756380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.756383] kworker/u4:0 D 0 4615 2 0x80000000
[ 1463.756395] Workqueue: writeback wb_workfn (flush-9:1)
[ 1463.756400] Call Trace:
[ 1463.756409] ? __schedule+0x29a/0x880
[ 1463.756420] ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.756426] schedule+0x78/0x110
[ 1463.756433] wait_barrier+0xdd/0x170 [raid10]
[ 1463.756440] ? wait_woken+0x80/0x80
[ 1463.756448] raid10_write_request+0xf2/0x900 [raid10]
[ 1463.756454] ? wait_woken+0x80/0x80
[ 1463.756459] ? mempool_alloc+0x55/0x160
[ 1463.756483] ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.756492] raid10_make_request+0xc1/0x120 [raid10]
[ 1463.756498] ? wait_woken+0x80/0x80
[ 1463.756514] md_handle_request+0x121/0x190 [md_mod]
[ 1463.756535] md_make_request+0x78/0x190 [md_mod]
[ 1463.756544] generic_make_request+0x1c6/0x470
[ 1463.756553] submit_bio+0x45/0x140
Writeback is blocked submitting I/O down in the MD driver.

[ 1463.756714] xfs_submit_ioend+0x9c/0x1e0 [xfs]
[ 1463.756844] xfs_vm_writepages+0x68/0x80 [xfs]
[ 1463.756856] do_writepages+0x31/0xb0
[ 1463.756865] ? read_hpet+0x126/0x130
[ 1463.756873] ? ktime_get+0x36/0xa0
[ 1463.756881] __writeback_single_inode+0x3d/0x3e0
[ 1463.756889] writeback_sb_inodes+0x1c4/0x430
[ 1463.756902] __writeback_inodes_wb+0x5d/0xb0
[ 1463.756910] wb_writeback+0x26b/0x310
[ 1463.756920] wb_workfn+0x33a/0x410
[ 1463.756932] process_one_work+0x1fd/0x420
[ 1463.756940] worker_thread+0x2d/0x3d0
[ 1463.756946] ? rescuer_thread+0x340/0x340
[ 1463.756951] kthread+0x112/0x130
[ 1463.756957] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.756965] ret_from_fork+0x3a/0x50
[ 1463.756979] INFO: task kworker/0:2:4994 blocked for more than 480 seconds.
[ 1463.756982] Not tainted 4.19.5-1-default #1
[ 1463.756984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.756987] kworker/0:2 D 0 4994 2 0x80000000
[ 1463.757013] Workqueue: md submit_flushes [md_mod]
[ 1463.757016] Call Trace:
[ 1463.757024] ? __schedule+0x29a/0x880
[ 1463.757034] ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.757039] schedule+0x78/0x110
[ 1463.757047] wait_barrier+0xdd/0x170 [raid10]
[ 1463.757054] ? wait_woken+0x80/0x80
[ 1463.757062] raid10_write_request+0xf2/0x900 [raid10]
[ 1463.757067] ? wait_woken+0x80/0x80
[ 1463.757072] ? mempool_alloc+0x55/0x160
[ 1463.757088] ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.757095] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1463.757104] raid10_make_request+0xc1/0x120 [raid10]
[ 1463.757110] ? wait_woken+0x80/0x80
[ 1463.757126] md_handle_request+0x121/0x190 [md_mod]
[ 1463.757132] ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.757137] ? finish_task_switch+0x74/0x260
[ 1463.757156] submit_flushes+0x21/0x40 [md_mod]
Some other MD task (?) also blocked submitting a request.

[ 1463.757163] process_one_work+0x1fd/0x420
[ 1463.757170] worker_thread+0x2d/0x3d0
[ 1463.757177] ? rescuer_thread+0x340/0x340
[ 1463.757181] kthread+0x112/0x130
[ 1463.757186] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.757193] ret_from_fork+0x3a/0x50
[ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
[ 1463.757207] Not tainted 4.19.5-1-default #1
[ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.757212] md1_resync D 0 5215 2 0x80000000
[ 1463.757216] Call Trace:
[ 1463.757223] ? __schedule+0x29a/0x880
[ 1463.757231] ? raise_barrier+0x8d/0x140 [raid10]
[ 1463.757236] schedule+0x78/0x110
[ 1463.757243] raise_barrier+0x8d/0x140 [raid10]
[ 1463.757248] ? wait_woken+0x80/0x80
[ 1463.757257] raid10_sync_request+0x1f6/0x1e30 [raid10]
[ 1463.757265] ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.757284] ? is_mddev_idle+0x125/0x137 [md_mod]
[ 1463.757302] md_do_sync.cold.78+0x404/0x969 [md_mod]
The md1 sync task is blocked, I'm not sure on what.

[ 1463.757311] ? wait_woken+0x80/0x80
[ 1463.757336] ? md_rdev_init+0xb0/0xb0 [md_mod]
[ 1463.757351] md_thread+0xe9/0x140 [md_mod]
[ 1463.757358] ? _raw_spin_unlock_irqrestore+0x2e/0x60
[ 1463.757364] ? __kthread_parkme+0x4c/0x70
[ 1463.757369] kthread+0x112/0x130
[ 1463.757374] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.757380] ret_from_fork+0x3a/0x50
[ 1463.757395] INFO: task xfsaild/md1:5233 blocked for more than 480 seconds.
[ 1463.757398] Not tainted 4.19.5-1-default #1
[ 1463.757400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.757402] xfsaild/md1 D 0 5233 2 0x80000000
[ 1463.757406] Call Trace:
[ 1463.757413] ? __schedule+0x29a/0x880
[ 1463.757421] ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.757426] schedule+0x78/0x110
[ 1463.757433] wait_barrier+0xdd/0x170 [raid10]
[ 1463.757438] ? wait_woken+0x80/0x80
[ 1463.757446] raid10_write_request+0xf2/0x900 [raid10]
[ 1463.757451] ? wait_woken+0x80/0x80
[ 1463.757455] ? mempool_alloc+0x55/0x160
[ 1463.757471] ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.757477] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1463.757485] raid10_make_request+0xc1/0x120 [raid10]
[ 1463.757491] ? wait_woken+0x80/0x80
[ 1463.757507] md_handle_request+0x121/0x190 [md_mod]
[ 1463.757527] md_make_request+0x78/0x190 [md_mod]
[ 1463.757536] generic_make_request+0x1c6/0x470
[ 1463.757544] submit_bio+0x45/0x140
xfsaild (metadata writeback) is also blocked submitting I/O down in the
MD driver.

[ 1463.757552] ? bio_add_page+0x48/0x60
[ 1463.757716] _xfs_buf_ioapply+0x2c1/0x450 [xfs]
[ 1463.757849] ? xfs_buf_delwri_submit_buffers+0xec/0x280 [xfs]
[ 1463.757974] __xfs_buf_submit+0x67/0x270 [xfs]
[ 1463.758102] xfs_buf_delwri_submit_buffers+0xec/0x280 [xfs]
[ 1463.758232] ? xfsaild+0x294/0x7e0 [xfs]
[ 1463.758364] xfsaild+0x294/0x7e0 [xfs]
[ 1463.758377] ? _raw_spin_unlock_irqrestore+0x2e/0x60
[ 1463.758508] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 1463.758514] kthread+0x112/0x130
[ 1463.758520] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.758527] ret_from_fork+0x3a/0x50
[ 1463.758543] INFO: task rpm:5364 blocked for more than 480 seconds.
[ 1463.758546] Not tainted 4.19.5-1-default #1
[ 1463.758547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.758550] rpm D 0 5364 3757 0x00000000
[ 1463.758554] Call Trace:
[ 1463.758563] ? __schedule+0x29a/0x880
[ 1463.758701] ? xlog_wait+0x5c/0x70 [xfs]
[ 1463.759821] schedule+0x78/0x110
[ 1463.760022] xlog_wait+0x5c/0x70 [xfs]
[ 1463.760036] ? wake_up_q+0x70/0x70
[ 1463.760167] __xfs_log_force_lsn+0x223/0x230 [xfs]
[ 1463.760297] ? xfs_file_fsync+0x196/0x1d0 [xfs]
[ 1463.760424] xfs_log_force_lsn+0x93/0x140 [xfs]
[ 1463.760552] xfs_file_fsync+0x196/0x1d0 [xfs]
An fsync is blocked, presumably on XFS log I/O completion.

[ 1463.760562] ? __sb_end_write+0x36/0x60
[ 1463.760571] do_fsync+0x38/0x70
[ 1463.760578] __x64_sys_fdatasync+0x13/0x20
[ 1463.760585] do_syscall_64+0x60/0x110
[ 1463.760594] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1463.760603] RIP: 0033:0x7f9757fae8a4
[ 1463.760616] Code: Bad RIP value.
[ 1463.760619] RSP: 002b:00007fff74fdb428 EFLAGS: 00000246 ORIG_RAX:
000000000000004b
[ 1463.760654] RAX: ffffffffffffffda RBX: 0000000000000064 RCX: 00007f9757fae8a4
[ 1463.760657] RDX: 00000000012c4c60 RSI: 00000000012cc130 RDI: 0000000000000004
[ 1463.760660] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f9758708c00
[ 1463.760662] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000012cc130
[ 1463.760665] R13: 000000000123a3a0 R14: 0000000000010830 R15: 0000000000000062
[ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
[ 1463.760683] Not tainted 4.19.5-1-default #1
[ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1463.760687] kworker/0:8 D 0 5367 2 0x80000000
[ 1463.760718] Workqueue: md submit_flushes [md_mod]
And that MD submit_flushes thing again.

Not to say there isn't some issue between XFS and MD going on here, but
I think we might want an MD person to take a look at this and possibly
provide some insight. From an XFS perspective, this all just looks like
we're blocked on I/O (via writeback, AIL and log) to a slow device.

Brian

[ 1463.760721] Call Trace:
[ 1463.760731] ? __schedule+0x29a/0x880
[ 1463.760741] ? wait_barrier+0xdd/0x170 [raid10]
[ 1463.760746] schedule+0x78/0x110
[ 1463.760753] wait_barrier+0xdd/0x170 [raid10]
[ 1463.760761] ? wait_woken+0x80/0x80
[ 1463.760768] raid10_write_request+0xf2/0x900 [raid10]
[ 1463.760774] ? wait_woken+0x80/0x80
[ 1463.760778] ? mempool_alloc+0x55/0x160
[ 1463.760795] ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760801] ? try_to_wake_up+0x44/0x470
[ 1463.760810] raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760816] ? wait_woken+0x80/0x80
[ 1463.760831] md_handle_request+0x121/0x190 [md_mod]
[ 1463.760851] md_make_request+0x78/0x190 [md_mod]
[ 1463.760860] generic_make_request+0x1c6/0x470
[ 1463.760870] raid10_write_request+0x77a/0x900 [raid10]
[ 1463.760875] ? wait_woken+0x80/0x80
[ 1463.760879] ? mempool_alloc+0x55/0x160
[ 1463.760895] ? md_write_start+0xa9/0x270 [md_mod]
[ 1463.760904] raid10_make_request+0xc1/0x120 [raid10]
[ 1463.760910] ? wait_woken+0x80/0x80
[ 1463.760926] md_handle_request+0x121/0x190 [md_mod]
[ 1463.760931] ? _raw_spin_unlock_irq+0x22/0x40
[ 1463.760936] ? finish_task_switch+0x74/0x260
[ 1463.760954] submit_flushes+0x21/0x40 [md_mod]
[ 1463.760962] process_one_work+0x1fd/0x420
[ 1463.760970] worker_thread+0x2d/0x3d0
[ 1463.760976] ? rescuer_thread+0x340/0x340
[ 1463.760981] kthread+0x112/0x130
[ 1463.760986] ? kthread_create_worker_on_cpu+0x40/0x40
[ 1463.760992] ret_from_fork+0x3a/0x50

--
Srdačan pozdrav/Best regards,
Siniša Bandin