Re: [PATCH] md/bitmap: wait for bitmap writes to complete during the tear down sequence

Song Liu <song@xxxxxxxxxx> · Sun, 11 Apr 2021 23:06:03 -0700

On Sun, Apr 11, 2021 at 2:04 AM heming.zhao@xxxxxxxx
<heming.zhao@xxxxxxxx> wrote:
>
> On 4/11/21 5:35 AM, Sudhakar Panneerselvam wrote:
> >> Hello Sudhakar,
> >>
> >> First, let's discuss with master branch kernel.
> >>
> >> What command or action stands for "tear down" ?
> >>   From your description, it very like ioctl STOP_ARRAY.
> >> Your crash was related with super_written, which is the callback for
> >> updating array sb, not bitmap sb. in md_update_sb() there is a sync
> >> point md_super_wait(), which will guarantee all sb bios finished successfully.
> >>
> >> for your patch, do you check md_bitmap_free, which already done the your patch's job.
> >>
> >> the call flow:
> >> ```
> >> do_md_stop //by STOP_ARRAY
> >>    + __md_stop_writes()
> >>    |  md_bitmap_flush
> >>    |  md_update_sb
> >>    |   + md_super_write
> >>    |   |  bio->bi_end_io = super_written
> >>    |   + md_super_wait(mddev) //wait for all bios done
> >>    + __md_stop(mddev)
> >>    |  md_bitmap_destroy(mddev);
> >>    |   + md_bitmap_free //see below
> >>    + ...
> >>
> >> md_bitmap_free
> >> {
> >>      ...
> >>       //do your patch job.
> >>       /* Shouldn't be needed - but just in case.... */
> >>       wait_event(bitmap->write_wait,
> >>              atomic_read(&bitmap->pending_writes) == 0);
> >>      ...
> >> }
> >> ```
> >>
> >> Would you share more analysis or test results for your patch?
> >
> > Hello Heming,
> >
> > Please find more details about our system configuration and the detailed
> > sequence that led to the crash.
> >
> > We have a RAID1 configured on an external imsm container. /proc/mdstat output
> > looks like this:
> >
> > ***
> > md24 : active raid1 sdb[1] sda[0]
> >        141557760 blocks super external:/md127/0 [2/2] [UU]
> >        bitmap: 1/2 pages [4KB], 65536KB chunk
> >
> > md127 : inactive sdb[1](S) sda[0](S)
> >        10402 blocks super external:imsm
> > ***
> >
> > All our system partition is laid out on top of the above RAID1 device as
> > shown below.
> >
> > ***
> > /dev/md24p5 on / type xfs (rw,relatime,attr2,inode64,noquota)
> > /dev/md24p1 on /boot type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p15 on /home type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p12 on /var type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p16 on /tmp type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p11 on /var/log type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > /dev/md24p14 on /var/log/audit type xfs (rw,nodev,relatime,attr2,inode64,noquota)
> > ***
> >
> > In such a configuration, the kernel panic described in this patch occurs
> > during the "shutdown" of the server. Since the system partition is laid out on
> > the RAID1, there could be write I/Os going to the RAID device as part of system
> > log and filesystem activity that typically occur during the shutdown.
> >
> >  From the core dump, I see that, after filesystems are unmounted and killing all
> > other userspace processes, "mdmon" thread initiates the "stop" on the md device.
> >
> > ***
> > COMMAND: "mdmon"
> >     TASK: ffff8ff2b1663c80  [THREAD_INFO: ffff8ff2b1663c80]
> >      CPU: 30
> >    STATE: TASK_UNINTERRUPTIBLE
> > crash> bt
> > PID: 7390   TASK: ffff8ff2b1663c80  CPU: 30  COMMAND: "mdmon"
> >   #0 [ffffb99d4eacba30] __schedule at ffffffff91884acc
> >   #1 [ffffb99d4eacbad0] schedule at ffffffff918850e6
> >   #2 [ffffb99d4eacbae8] schedule_timeout at ffffffff91889616
> >   #3 [ffffb99d4eacbb78] wait_for_completion at ffffffff91885d1b
> >   #4 [ffffb99d4eacbbe0] __wait_rcu_gp at ffffffff9110c123
> >   #5 [ffffb99d4eacbc28] synchronize_sched at ffffffff9111008e
> >   #6 [ffffb99d4eacbc78] unbind_rdev_from_array at ffffffff9169640f
> >   #7 [ffffb99d4eacbcc0] do_md_stop at ffffffff9169f163
> >   #8 [ffffb99d4eacbd58] array_state_store at ffffffff9169f644
> >   #9 [ffffb99d4eacbd90] md_attr_store at ffffffff9169b71e
> > #10 [ffffb99d4eacbdc8] sysfs_kf_write at ffffffff91320edf
> > #11 [ffffb99d4eacbdd8] kernfs_fop_write at ffffffff913203c4
> > #12 [ffffb99d4eacbe18] __vfs_write at ffffffff9128fcfa
> > #13 [ffffb99d4eacbea0] vfs_write at ffffffff9128ffd2
> > #14 [ffffb99d4eacbee0] sys_write at ffffffff9129025c
> > #15 [ffffb99d4eacbf28] do_syscall_64 at ffffffff91003a39
> > #16 [ffffb99d4eacbf50] entry_SYSCALL_64_after_hwframe at ffffffff91a001b1
> >      RIP: 00007ff3bfc846fd  RSP: 00007ff3bf8a6cf0  RFLAGS: 00000293
> >      RAX: ffffffffffffffda  RBX: 000055e257ef0941  RCX: 00007ff3bfc846fd
> >      RDX: 0000000000000005  RSI: 000055e257ef0941  RDI: 0000000000000010
> >      RBP: 0000000000000010   R8: 0000000000000001   R9: 0000000000000000
> >      R10: 00000000ffffff01  R11: 0000000000000293  R12: 0000000000000001
> >      R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000000
> >      ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
> > crash>
> > ***
> >
> > While handling the "md" stop in the kernel, the code sequence based on the
> > above md configuration is
> >
> > do_md_stop
> >   + __md_stop_writes
> >   | + md_bitmap_flush
> >   | | + md_bitmap_daemon_work
> >   | | | + md_bitmap_wait_writes()
> >   | | | | (This wait is for bitmap writes that happened up until now and to avoid
> >   | | | |  overlapping with the new bitmap writes.)
> >   | | | | (wait flag is set to zero for async writes to bitmap.)
> >   | | | + write_page()
> >   | | | | | (For external bitmap, bitmap->storage.file is NULL, hence ends up
> >   | | | | |  calling write_sb_page())
> >   | | | | + write_sb_page()
> >   | | | | | + md_super_write()
> >   | | | | | | (Since the wait flag is false, the bitmap write is submitted
> >   | | | | | |  without waiting for it to complete.)
> >   | | | | | | + mddev->pending_writes is incremented, then the IO is submitted
> >   | | | | | |   and the call returns without waiting
> >   | | + md_bitmap_update_sb() - (This call simply returns because
> >   | | | | "bitmap->mddev->bitmap_info.external" is true for external bitmaps)
> >   | + md_update_sb() - (This won't be called as the call is conditional and the
> >   | | | | | | condition evaluates to false in my setup(see below for crash info)
> >   + __md_stop
> >   | + md_bitmap_destroy
> >   | | + md_bitmap_free
> >   | | | + wait_event(bitmap->write_wait,
> >   | | | |            atomic_read(&bitmap->pending_writes) == 0);
> >   | | | | bitmap->pending_writes is zero, but the mddev->pending_writes is 1,
> >   | | | | so this call returns immediately.
> >   | md detachment continues while there is pending mddev I/O cauing NULL pointer
> >   | derefence when the I/O callback, super_written is called.
> >
> > ***
> >          crash> struct mddev.bitmap_info.external 0xffff8ffb62f13800
> >                  bitmap_info.external = 1,
> >          crash> struct mddev.in_sync 0xffff8ffb62f13800
> >                  in_sync = 1
> >          crash> struct mddev.sb_flags 0xffff8ffb62f13800
> >                  sb_flags = 0
> >          crash> struct mddev.ro 0xffff8ffb62f13800
> >                  ro = 0
> >          crash> struct mddev.cluster_info 0xffff8ffb62f13800
> >                  cluster_info = 0x0
> >          crash>
> > ***
> >
> > Please review and let me know your thoughts.
> >
>
> Hello
>
> On previous mail, I was wrong to assume the array mddev->bitmap_info.external is 0. So my analysis was not suite for your case.
>
> Now we understand your env, which is related with IMSM & external bitmap, and I have a little knowledge about this field. May need to add more people in the cc list.
>
> Just from code logic, your analysis is reasonable.
> The key of the crash is why md layer didn't do the waiting job for bitmap bios. There have wait points in md_bitmap_daemon_work, write_sb_page, md_bitmap_update_sb & md_update_sb. But in your scenario there all didn't take effect. At last md_bitmap_free do the additional wait bitmap->pending_writes but not mddev->pending_writes. the crash is triggered.
>
> In my opinion, using a special wait is more clear than calling general md_bitmap_wait_writes(). the md_bitmap_wait_writes makes people feel bitmap module does repetitive clean job.
>
> My idea like:
> ```
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 200c5d0f08bf..ea6fa5a2cb6b 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -1723,6 +1723,8 @@ void md_bitmap_flush(struct mddev *mddev)
>          bitmap->daemon_lastrun -= sleep;
>          md_bitmap_daemon_work(mddev);
>          md_bitmap_update_sb(bitmap);
> +       if (mddev->bitmap_info.external)
> +               md_super_wait(mddev);
>   }
>
>   /*
> @@ -1746,6 +1748,7 @@ void md_bitmap_free(struct bitmap *bitmap)
>          /* Shouldn't be needed - but just in case.... */
>          wait_event(bitmap->write_wait,
>                     atomic_read(&bitmap->pending_writes) == 0);
> +       wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
>
>          /* release the bitmap file  */
>          md_bitmap_file_unmap(&bitmap->storage);
> ```

I like Heming's version better.

Hi Sudhakar,

Are you able to reproduce the issue? If so, could you please verify
that Heming's
change fixes the issue?

Thanks,
Song