On 1/31/20 11:22 AM, Guoqing Jiang wrote:
On 1/31/20 3:43 PM, Nigel Croxon wrote:
Hello all,
I have a customer issue, low memory deadlock with md devices and
external (imsm) metadata handling
https://bugzilla.redhat.com/show_bug.cgi?id=1703180
Has anyone else seen this problem?
I think most people out of RH are not authorized to the website.
Guoqing
A customer has been having hung systems periodically. A vmcore was
captured of one stuck system and tasks were found to be stuck in
md_write_start. One example:
crash> bt 20908
PID: 20908 TASK: ffff96ce1e16c100 CPU: 46 COMMAND: "kworker/u114:1"
#0 [ffff96c68627b728] __schedule at ffffffff8a567747
#1 [ffff96c68627b7b0] schedule at ffffffff8a567c49
#2 [ffff96c68627b7c0] md_write_start at ffffffff8a38ed7c
#3 [ffff96c68627b820] raid1_write_request at ffffffffc053500c [raid1]
#4 [ffff96c68627b910] raid1_make_request at ffffffffc0535ac2 [raid1]
#5 [ffff96c68627b9d0] md_handle_request at ffffffff8a38d740
#6 [ffff96c68627ba40] md_make_request at ffffffff8a38d8a9
#7 [ffff96c68627ba68] generic_make_request at ffffffff8a145267
#8 [ffff96c68627bac0] submit_bio at ffffffff8a145510
#9 [ffff96c68627bb18] xfs_submit_ioend at ffffffffc080f251 [xfs]
#10 [ffff96c68627bb40] xfs_vm_writepages at ffffffffc080f3bf [xfs]
#11 [ffff96c68627bbb0] do_writepages at ffffffff89fc3431
#12 [ffff96c68627bbc0] __writeback_single_inode at ffffffff8a06fee0
#13 [ffff96c68627bc08] writeback_sb_inodes at ffffffff8a070974
#14 [ffff96c68627bcb0] __writeback_inodes_wb at ffffffff8a070cdf
#15 [ffff96c68627bcf8] wb_writeback at ffffffff8a071513
#16 [ffff96c68627bd70] bdi_writeback_workfn at ffffffff8a071d9c
#17 [ffff96c68627be20] process_one_work at ffffffff89eb9d8f
#18 [ffff96c68627be68] worker_thread at ffffffff89ebae26
#19 [ffff96c68627bec8] kthread at ffffffff89ec1c71
The tasks were waiting for MD_SB_CHANGE_PENDING to clear.
crash> px ((struct mddev *)0xffff96c652c37800)->sb_flags
$5 = 0x4
But the system isn't using md-native metadata. Instead, it's using
"imsm" metadata which is handled by mdmon not by the md device's kernel
thread. The kernel notes the use of a metadata handling program by
having the mddev flagged as external:
crash> px ((struct mddev *)0xffff96c652c37800)->external
$14 = 0x1
And the sosreport shows the version as "imsm" and not an md version.
$ cat sos_commands/md/mdadm_-D_.dev.md | grep Version
Version : imsm
The mdmon task had been idle and waiting in select() for slighly longer
than the oldest tasks stuck in uninterruptible state:
crash> ps -m mdmon
[ 0 00:14:12.722] [IN] PID: 12346 TASK: ffff96ce7ff3d140 CPU: 8
COMMAND: "mdmon"
[ 2 16:33:16.686] [IN] PID: 12344 TASK: ffff96ce5d3430c0 CPU: 27
COMMAND: "mdmon"
crash> ps -m | grep UN | tail
[ 0 00:14:08.338] [UN] PID: 310 TASK: ffff96ce7f716180 CPU: 16
COMMAND: "kswapd1"
[ 0 00:14:08.404] [UN] PID: 12057 TASK: ffff96bf61f78000 CPU: 0
COMMAND: "splunkd"
[ 0 00:14:08.150] [UN] PID: 10929 TASK: ffff96c6571e5140 CPU: 20
COMMAND: "splunkd"
[ 0 00:14:08.180] [UN] PID: 1 TASK: ffff96bf74b60000 CPU: 33
COMMAND: "systemd"
[ 0 00:14:08.231] [UN] PID: 11179 TASK: ffff96c6b2bc6180 CPU: 55
COMMAND: "splunkd"
[ 0 00:14:08.138] [UN] PID: 11713 TASK: ffff96ce7f711040 CPU: 15
COMMAND: "splunkd"
[ 0 00:14:08.155] [UN] PID: 12017 TASK: ffff96c6ef518000 CPU: 17
COMMAND: "splunkd"
[ 0 00:14:08.224] [UN] PID: 11963 TASK: ffff96bf745f8000 CPU: 52
COMMAND: "splunkd"
[ 0 00:14:09.561] [UN] PID: 20684 TASK: ffff96c65342c100 CPU: 8
COMMAND: "kworker/8:2"
[ 0 00:14:09.757] [UN] PID: 22690 TASK: ffff96ce502ad140 CPU: 42
COMMAND: "kworker/42:1"
mdmon never handled the change event and called into md to clear the
flag because the file it was polling never received the event md starts
by calling sysfs_notify_dirent(). The processing to push the event to
the file and wake up any polling tasks was still waiting in the kernfs
notify list.
crash> px ((struct mddev *)0xffff96c652c37800)->sysfs_state
$30 = (struct kernfs_node *) 0xffff96c658f90d20
crash> p kernfs_notify_list
kernfs_notify_list = $31 = (struct kernfs_node *) 0xffff96c658f90d20
This was still waiting as the work item which would handle the notify
events on this list was unable to receive a kworker from low memory.
kernfs uses the generic system_wq workqueue for this work item. This
work item was waiting in a worker_pool for cpu 8:
crash> struct worker_pool 0xffff96c65f81a400
struct worker_pool {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
cpu = 8,
node = 0,
id = 16,
flags = 0,
watchdog_ts = 8210378176,
worklist = {
next = 0xffffffff8aab4a88,
prev = 0xffff96c65f817d08
},
nr_workers = 2,
nr_idle = 0,
...
Walking the list of pending items:
crash> struct worker_pool.worklist 0xffff96c65f81a400 -o
struct worker_pool {
[ffff96c65f81a420] struct list_head worklist;
}
crash> list -H ffff96c65f81a420 -o work_struct.entry -s work_struct.func
ffffffff8aab4a80
func = 0xffffffff8a0cb660
ffff96c65d561428
func = 0xffffffffc0839d30
ffff96c65a62b108
func = 0xffffffffc03bec80 <dm_wq_work>
ffff96c652c34108
func = 0xffffffffc03bec80 <dm_wq_work>
ffff96c65a629108
func = 0xffffffffc03bec80 <dm_wq_work>
ffff96ce5a4af9f0
func = 0xffffffff8a2560d0
ffff96bf7fd5b800
func = 0xffffffff8a25f4b0
ffff96ce4dadbd10
func = 0xffffffff89eb7300
ffff96c65f817d00
func = 0xffffffff89fd7b40
crash> sym 0xffffffff8a0cb660
ffffffff8a0cb660 (t) kernfs_notify_workfn
/usr/src/debug/kernel-3.10.0-957.5.1.el7/linux-3.10.0-957.5.1.el7.x86_64/fs/kernfs/file.c:
800
The first item on the list was the work item to process the notifications.
md's use of notifications through sysfs (kernfs) files with alternate
metadata formats like imsm can deadlock under memory pressure. In order
for writes to advance and free up memory, md needs a notification to
wake up mdmon to complete. But the notification cannot advance as the
memory pressure means there is no kworker task to run kernfs's notify
work item. The system deadlocked.
This issue is triggered by md's use of notifications through sysfs, but
I'm not sure this really belongs as an md bug. kernfs's workqueue usage
makes what md is doing to send file events unreliable.