Re: low memory deadlock with md devices and external (imsm) metadata handling

Nigel Croxon <ncroxon@xxxxxxxxxx> · Fri, 31 Jan 2020 12:10:55 -0500

On 1/31/20 11:22 AM, Guoqing Jiang wrote:

On 1/31/20 3:43 PM, Nigel Croxon wrote:
Hello all,

I have a customer issue,  low memory deadlock with md devices and 
external (imsm) metadata handling

https://bugzilla.redhat.com/show_bug.cgi?id=1703180

Has anyone else seen this problem?

I think most people out of RH are not authorized to the website.

Guoqing

A customer has been having hung systems periodically.  A vmcore was 
captured of one stuck system and tasks were found to be stuck in 
md_write_start.  One example:

crash> bt 20908
PID: 20908  TASK: ffff96ce1e16c100  CPU: 46  COMMAND: "kworker/u114:1"
 #0 [ffff96c68627b728] __schedule at ffffffff8a567747
 #1 [ffff96c68627b7b0] schedule at ffffffff8a567c49
 #2 [ffff96c68627b7c0] md_write_start at ffffffff8a38ed7c
 #3 [ffff96c68627b820] raid1_write_request at ffffffffc053500c [raid1]
 #4 [ffff96c68627b910] raid1_make_request at ffffffffc0535ac2 [raid1]
 #5 [ffff96c68627b9d0] md_handle_request at ffffffff8a38d740
 #6 [ffff96c68627ba40] md_make_request at ffffffff8a38d8a9
 #7 [ffff96c68627ba68] generic_make_request at ffffffff8a145267
 #8 [ffff96c68627bac0] submit_bio at ffffffff8a145510
 #9 [ffff96c68627bb18] xfs_submit_ioend at ffffffffc080f251 [xfs]
#10 [ffff96c68627bb40] xfs_vm_writepages at ffffffffc080f3bf [xfs]
#11 [ffff96c68627bbb0] do_writepages at ffffffff89fc3431
#12 [ffff96c68627bbc0] __writeback_single_inode at ffffffff8a06fee0
#13 [ffff96c68627bc08] writeback_sb_inodes at ffffffff8a070974
#14 [ffff96c68627bcb0] __writeback_inodes_wb at ffffffff8a070cdf
#15 [ffff96c68627bcf8] wb_writeback at ffffffff8a071513
#16 [ffff96c68627bd70] bdi_writeback_workfn at ffffffff8a071d9c
#17 [ffff96c68627be20] process_one_work at ffffffff89eb9d8f
#18 [ffff96c68627be68] worker_thread at ffffffff89ebae26
#19 [ffff96c68627bec8] kthread at ffffffff89ec1c71

The tasks were waiting for MD_SB_CHANGE_PENDING to clear.

crash> px ((struct mddev *)0xffff96c652c37800)->sb_flags
$5 = 0x4

But the system isn't using md-native metadata.  Instead, it's using 
"imsm" metadata which is handled by mdmon not by the md device's kernel 
thread.  The kernel notes the use of a metadata handling program by 
having the mddev flagged as external:

crash> px ((struct mddev *)0xffff96c652c37800)->external
$14 = 0x1

And the sosreport shows the version as "imsm" and not an md version.

$ cat sos_commands/md/mdadm_-D_.dev.md | grep Version
           Version : imsm

The mdmon task had been idle and waiting in select() for slighly longer 
than the oldest tasks stuck in uninterruptible state:

crash> ps -m mdmon
[ 0 00:14:12.722] [IN]  PID: 12346  TASK: ffff96ce7ff3d140  CPU: 8 
COMMAND: "mdmon"
[ 2 16:33:16.686] [IN]  PID: 12344  TASK: ffff96ce5d3430c0  CPU: 27 
COMMAND: "mdmon"

crash> ps -m | grep UN | tail
[ 0 00:14:08.338] [UN]  PID: 310    TASK: ffff96ce7f716180  CPU: 16 
COMMAND: "kswapd1"
[ 0 00:14:08.404] [UN]  PID: 12057  TASK: ffff96bf61f78000  CPU: 0 
COMMAND: "splunkd"
[ 0 00:14:08.150] [UN]  PID: 10929  TASK: ffff96c6571e5140  CPU: 20 
COMMAND: "splunkd"
[ 0 00:14:08.180] [UN]  PID: 1      TASK: ffff96bf74b60000  CPU: 33 
COMMAND: "systemd"
[ 0 00:14:08.231] [UN]  PID: 11179  TASK: ffff96c6b2bc6180  CPU: 55 
COMMAND: "splunkd"
[ 0 00:14:08.138] [UN]  PID: 11713  TASK: ffff96ce7f711040  CPU: 15 
COMMAND: "splunkd"
[ 0 00:14:08.155] [UN]  PID: 12017  TASK: ffff96c6ef518000  CPU: 17 
COMMAND: "splunkd"
[ 0 00:14:08.224] [UN]  PID: 11963  TASK: ffff96bf745f8000  CPU: 52 
COMMAND: "splunkd"
[ 0 00:14:09.561] [UN]  PID: 20684  TASK: ffff96c65342c100  CPU: 8 
COMMAND: "kworker/8:2"
[ 0 00:14:09.757] [UN]  PID: 22690  TASK: ffff96ce502ad140  CPU: 42 
COMMAND: "kworker/42:1"

mdmon never handled the change event and called into md to clear the 
flag because the file it was polling never received the event md starts 
by calling sysfs_notify_dirent().  The processing to push the event to 
the file and wake up any polling tasks was still waiting in the kernfs 
notify list.

crash> px ((struct mddev *)0xffff96c652c37800)->sysfs_state
$30 = (struct kernfs_node *) 0xffff96c658f90d20

crash> p kernfs_notify_list
kernfs_notify_list = $31 = (struct kernfs_node *) 0xffff96c658f90d20

This was still waiting as the work item which would handle the notify 
events on this list was unable to receive a kworker from low memory.  
kernfs uses the generic system_wq workqueue for this work item.  This 
work item was waiting in a worker_pool for cpu 8:

crash> struct worker_pool 0xffff96c65f81a400
struct worker_pool {
  lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  cpu = 8,
  node = 0,
  id = 16,
  flags = 0,
  watchdog_ts = 8210378176,
  worklist = {
    next = 0xffffffff8aab4a88,
    prev = 0xffff96c65f817d08
  },
  nr_workers = 2,
  nr_idle = 0,
...

Walking the list of pending items:

crash> struct worker_pool.worklist 0xffff96c65f81a400 -o
struct worker_pool {
  [ffff96c65f81a420] struct list_head worklist;
}

crash> list -H ffff96c65f81a420 -o work_struct.entry -s work_struct.func
ffffffff8aab4a80
  func = 0xffffffff8a0cb660
ffff96c65d561428
  func = 0xffffffffc0839d30
ffff96c65a62b108
  func = 0xffffffffc03bec80 <dm_wq_work>
ffff96c652c34108
  func = 0xffffffffc03bec80 <dm_wq_work>
ffff96c65a629108
  func = 0xffffffffc03bec80 <dm_wq_work>
ffff96ce5a4af9f0
  func = 0xffffffff8a2560d0
ffff96bf7fd5b800
  func = 0xffffffff8a25f4b0
ffff96ce4dadbd10
  func = 0xffffffff89eb7300
ffff96c65f817d00
  func = 0xffffffff89fd7b40

crash> sym 0xffffffff8a0cb660
ffffffff8a0cb660 (t) kernfs_notify_workfn 
/usr/src/debug/kernel-3.10.0-957.5.1.el7/linux-3.10.0-957.5.1.el7.x86_64/fs/kernfs/file.c: 
800

The first item on the list was the work item to process the notifications.

md's use of notifications through sysfs (kernfs) files with alternate 
metadata formats like imsm can deadlock under memory pressure.  In order 
for writes to advance and free up memory, md needs a notification to 
wake up mdmon to complete.  But the notification cannot advance as the 
memory pressure means there is no kworker task to run kernfs's notify 
work item.  The system deadlocked.

This issue is triggered by md's use of notifications through sysfs, but 
I'm not sure this really belongs as an md bug. kernfs's workqueue usage 
makes what md is doing to send file events unreliable.