> Hello Sudhakar, > > First, let's discuss with master branch kernel. > > What command or action stands for "tear down" ? > From your description, it very like ioctl STOP_ARRAY. > Your crash was related with super_written, which is the callback for > updating array sb, not bitmap sb. in md_update_sb() there is a sync > point md_super_wait(), which will guarantee all sb bios finished successfully. > > for your patch, do you check md_bitmap_free, which already done the your patch's job. > > the call flow: > ``` > do_md_stop //by STOP_ARRAY > + __md_stop_writes() > | md_bitmap_flush > | md_update_sb > | + md_super_write > | | bio->bi_end_io = super_written > | + md_super_wait(mddev) //wait for all bios done > + __md_stop(mddev) > | md_bitmap_destroy(mddev); > | + md_bitmap_free //see below > + ... > > md_bitmap_free > { > ... > //do your patch job. > /* Shouldn't be needed - but just in case.... */ > wait_event(bitmap->write_wait, > atomic_read(&bitmap->pending_writes) == 0); > ... > } > ``` > > Would you share more analysis or test results for your patch? Hello Heming, Please find more details about our system configuration and the detailed sequence that led to the crash. We have a RAID1 configured on an external imsm container. /proc/mdstat output looks like this: *** md24 : active raid1 sdb[1] sda[0] 141557760 blocks super external:/md127/0 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk md127 : inactive sdb[1](S) sda[0](S) 10402 blocks super external:imsm *** All our system partition is laid out on top of the above RAID1 device as shown below. *** /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota) /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota) /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota) /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota) /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota) /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota) /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota) *** In such a configuration, the kernel panic described in this patch occurs during the "shutdown" of the server. Since the system partition is laid out on the RAID1, there could be write I/Os going to the RAID device as part of system log and filesystem activity that typically occur during the shutdown. >From the core dump, I see that, after filesystems are unmounted and killing all other userspace processes, "mdmon" thread initiates the "stop" on the md device. *** COMMAND: "mdmon" TASK: ffff8ff2b1663c80 [THREAD_INFO: ffff8ff2b1663c80] CPU: 30 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 7390 TASK: ffff8ff2b1663c80 CPU: 30 COMMAND: "mdmon" #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6 #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616 #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123 #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163 #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644 #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4 #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2 #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39 #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1 RIP: 00007ff3bfc846fd RSP: 00007ff3bf8a6cf0 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 000055e257ef0941 RCX: 00007ff3bfc846fd RDX: 0000000000000005 RSI: 000055e257ef0941 RDI: 0000000000000010 RBP: 0000000000000010 R8: 0000000000000001 R9: 0000000000000000 R10: 00000000ffffff01 R11: 0000000000000293 R12: 0000000000000001 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> *** While handling the "md" stop in the kernel, the code sequence based on the above md configuration is do_md_stop + __md_stop_writes | + md_bitmap_flush | | + md_bitmap_daemon_work | | | + md_bitmap_wait_writes() | | | | (This wait is for bitmap writes that happened up until now and to avoid | | | | overlapping with the new bitmap writes.) | | | | (wait flag is set to zero for async writes to bitmap.) | | | + write_page() | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up | | | | | calling write_sb_page()) | | | | + write_sb_page() | | | | | + md_super_write() | | | | | | (Since the wait flag is false, the bitmap write is submitted | | | | | | without waiting for it to complete.) | | | | | | + mddev->pending_writes is incremented, then the IO is submitted | | | | | | and the call returns without waiting | | + md_bitmap_update_sb() - (This call simply returns because | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps) | + md_update_sb() - (This won't be called as the call is conditional and the | | | | | | condition evaluates to false in my setup(see below for crash info) + __md_stop | + md_bitmap_destroy | | + md_bitmap_free | | | + wait_event(bitmap->write_wait, | | | | atomic_read(&bitmap->pending_writes) == 0); | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1, | | | | so this call returns immediately. | md detachment continues while there is pending mddev I/O cauing NULL pointer | derefence when the I/O callback, super_written is called. *** crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800 bitmap_info.external = 1, crash> struct mddev.in_sync 0xffff8ffb62f13800 in_sync = 1 crash> struct mddev.sb_flags 0xffff8ffb62f13800 sb_flags = 0 crash> struct mddev.ro 0xffff8ffb62f13800 ro = 0 crash> struct mddev.cluster_info 0xffff8ffb62f13800 cluster_info = 0x0 crash> *** Please review and let me know your thoughts. Thanks Sudhakar > > Thanks, > Heming