On Sun, Apr 11, 2021 at 2:04 AM heming.zhao@xxxxxxxx <heming.zhao@xxxxxxxx> wrote: > > On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote: > >> Hello Sudhakar, > >> > >> First, let's discuss with master branch kernel. > >> > >> What command or action stands for "tear down" ? > >> From your description, it very like ioctl STOP_ARRAY. > >> Your crash was related with super_written, which is the callback for > >> updating array sb, not bitmap sb. in md_update_sb() there is a sync > >> point md_super_wait(), which will guarantee all sb bios finished successfully. > >> > >> for your patch, do you check md_bitmap_free, which already done the your patch's job. > >> > >> the call flow: > >> ``` > >> do_md_stop //by STOP_ARRAY > >> + __md_stop_writes() > >> | md_bitmap_flush > >> | md_update_sb > >> | + md_super_write > >> | | bio->bi_end_io = super_written > >> | + md_super_wait(mddev) //wait for all bios done > >> + __md_stop(mddev) > >> | md_bitmap_destroy(mddev); > >> | + md_bitmap_free //see below > >> + ... > >> > >> md_bitmap_free > >> { > >> ... > >> //do your patch job. > >> /* Shouldn't be needed - but just in case.... */ > >> wait_event(bitmap->write_wait, > >> atomic_read(&bitmap->pending_writes) == 0); > >> ... > >> } > >> ``` > >> > >> Would you share more analysis or test results for your patch? > > > > Hello Heming, > > > > Please find more details about our system configuration and the detailed > > sequence that led to the crash. > > > > We have a RAID1 configured on an external imsm container. /proc/mdstat output > > looks like this: > > > > *** > > md24 : active raid1 sdb[1] sda[0] > > 141557760 blocks super external:/md127/0 [2/2] [UU] > > bitmap: 1/2 pages [4KB], 65536KB chunk > > > > md127 : inactive sdb[1](S) sda[0](S) > > 10402 blocks super external:imsm > > *** > > > > All our system partition is laid out on top of the above RAID1 device as > > shown below. > > > > *** > > /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota) > > /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota) > > *** > > > > In such a configuration, the kernel panic described in this patch occurs > > during the "shutdown" of the server. Since the system partition is laid out on > > the RAID1, there could be write I/Os going to the RAID device as part of system > > log and filesystem activity that typically occur during the shutdown. > > > > From the core dump, I see that, after filesystems are unmounted and killing all > > other userspace processes, "mdmon" thread initiates the "stop" on the md device. > > > > *** > > COMMAND: "mdmon" > > TASK: ffff8ff2b1663c80 [THREAD_INFO: ffff8ff2b1663c80] > > CPU: 30 > > STATE: TASK_UNINTERRUPTIBLE > > crash> bt > > PID: 7390 TASK: ffff8ff2b1663c80 CPU: 30 COMMAND: "mdmon" > > #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc > > #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6 > > #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616 > > #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b > > #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123 > > #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e > > #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f > > #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163 > > #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644 > > #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e > > #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf > > #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4 > > #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa > > #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2 > > #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c > > #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39 > > #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1 > > RIP: 00007ff3bfc846fd RSP: 00007ff3bf8a6cf0 RFLAGS: 00000293 > > RAX: ffffffffffffffda RBX: 000055e257ef0941 RCX: 00007ff3bfc846fd > > RDX: 0000000000000005 RSI: 000055e257ef0941 RDI: 0000000000000010 > > RBP: 0000000000000010 R8: 0000000000000001 R9: 0000000000000000 > > R10: 00000000ffffff01 R11: 0000000000000293 R12: 0000000000000001 > > R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 > > ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b > > crash> > > *** > > > > While handling the "md" stop in the kernel, the code sequence based on the > > above md configuration is > > > > do_md_stop > > + __md_stop_writes > > | + md_bitmap_flush > > | | + md_bitmap_daemon_work > > | | | + md_bitmap_wait_writes() > > | | | | (This wait is for bitmap writes that happened up until now and to avoid > > | | | | overlapping with the new bitmap writes.) > > | | | | (wait flag is set to zero for async writes to bitmap.) > > | | | + write_page() > > | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up > > | | | | | calling write_sb_page()) > > | | | | + write_sb_page() > > | | | | | + md_super_write() > > | | | | | | (Since the wait flag is false, the bitmap write is submitted > > | | | | | | without waiting for it to complete.) > > | | | | | | + mddev->pending_writes is incremented, then the IO is submitted > > | | | | | | and the call returns without waiting > > | | + md_bitmap_update_sb() - (This call simply returns because > > | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps) > > | + md_update_sb() - (This won't be called as the call is conditional and the > > | | | | | | condition evaluates to false in my setup(see below for crash info) > > + __md_stop > > | + md_bitmap_destroy > > | | + md_bitmap_free > > | | | + wait_event(bitmap->write_wait, > > | | | | atomic_read(&bitmap->pending_writes) == 0); > > | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1, > > | | | | so this call returns immediately. > > | md detachment continues while there is pending mddev I/O cauing NULL pointer > > | derefence when the I/O callback, super_written is called. > > > > *** > > crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800 > > bitmap_info.external = 1, > > crash> struct mddev.in_sync 0xffff8ffb62f13800 > > in_sync = 1 > > crash> struct mddev.sb_flags 0xffff8ffb62f13800 > > sb_flags = 0 > > crash> struct mddev.ro 0xffff8ffb62f13800 > > ro = 0 > > crash> struct mddev.cluster_info 0xffff8ffb62f13800 > > cluster_info = 0x0 > > crash> > > *** > > > > Please review and let me know your thoughts. > > > > Hello > > On previous mail, I was wrong to assume the array mddev->bitmap_info.external is 0. So my analysis was not suite for your case. > > Now we understand your env, which is related with IMSM & external bitmap, and I have a little knowledge about this field. May need to add more people in the cc list. > > Just from code logic, your analysis is reasonable. > The key of the crash is why md layer didn't do the waiting job for bitmap bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_bitmap_update_sb & md_update_sb. But in your scenario there all didn't take effect. At last md_bitmap_free do the additional wait bitmap->pending_writes but not mddev->pending_writes. the crash is triggered. > > In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap module does repetitive clean job. > > My idea like: > ``` > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c > index 200c5d0f08bf..ea6fa5a2cb6b 100644 > --- a/drivers/md/md-bitmap.c > +++ b/drivers/md/md-bitmap.c > @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev) > bitmap->daemon_lastrun -= sleep; > md_bitmap_daemon_work(mddev); > md_bitmap_update_sb(bitmap); > + if (mddev->bitmap_info.external) > + md_super_wait(mddev); > } > > /* > @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap) > /* Shouldn't be needed - but just in case.... */ > wait_event(bitmap->write_wait, > atomic_read(&bitmap->pending_writes) == 0); > + wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0); > > /* release the bitmap file */ > md_bitmap_file_unmap(&bitmap->storage); > ``` I like Heming's version better. Hi Sudhakar, Are you able to reproduce the issue? If so, could you please verify that Heming's change fixes the issue? Thanks, Song