Re: RAID1 sometimes have different data on the slave devices

Jinpu Wang <jinpuwang@xxxxxxxxx> · Wed, 22 Aug 2018 15:43:46 +0200

NeilBrown <neilb@xxxxxxxx> 于2018年8月16日周四 上午7:42写道：
>
> On Wed, Aug 15 2018, Jinpu Wang wrote:
>
> > Jack Wang <jack.wang.usish@xxxxxxxxx> 于2018年8月14日周二 下午12:43写道：
> >>
> >> NeilBrown <neilb@xxxxxxxx> 于2018年8月14日周二 上午10:53写道：
> >> >
> >> > On Tue, Aug 14 2018, Jinpu Wang wrote:
> >> >
> >> > > NeilBrown <neilb@xxxxxxxx> 于2018年8月14日周二 上午1:31写道：
> >> > >>
> >> > >> On Mon, Aug 13 2018, David C. Rankin wrote:
> >> > >>
> >> > >> > On 08/11/2018 02:06 AM, NeilBrown wrote:
> >> > >> >> It might be expected behaviour with async direct IO.
> >> > >> >> Two threads writing with O_DIRECT io to the same address could result in
> >> > >> >> different data on the two devices.  This doesn't seem to me to be a
> >> > >> >> credible use-case though.  Why would you ever want to do that in
> >> > >> >> practice?
> >> > >> >>
> >> > >> >> NeilBrown
> >> > >> >
> >> > >> >   My only thought is while the credible case may be weak, if it is something
> >> > >> > that can be protected against with a few conditionals to prevent the different
> >> > >> > data on the slaves diverging -- then it's worth a couple of conditions to
> >> > >> > prevent the nut that know just enough about dd from confusing things....
> >> > >>
> >> > >> Yes, it can be protected against - the code is already written.
> >> > >> If you have a 2-drive raid1 and want it to be safe against this attack,
> >> > >> simply:
> >> > >>
> >> > >>   mdadm /dev/md127 --grow --level=raid5
> >> > >>
> >> > >> This will add the required synchronization between writes so that
> >> > >> multiple writes to the one block are linearized.  There will be a
> >> > >> performance impact.
> >> > >>
> >> > >> NeilBrown
> >> > > Thanks for your comments, Neil.
> >> > > Convert to raid5 with 2 drives will not only  cause perrormance drop,
> >> > > will also disable the redundancy.
> >> > > It's clearly a no go.
> >> >
> >> > I don't understand why you think it would disable the redundancy, there
> >> > are still two copies of every block.  Both RAID1 and RAID5 can survive a
> >> > single device failure.
> >> I thought RAID5 requirs at least 3 drive with parity, clearly, I was
> >> wrong. Sorry.
> >>
> >> I'm testing the script with raid5, if works as expected.
> > I did test on raid5 with 2 drives, indeed, there's no mismatch found.
> > But instead
> > I triggered some hung task below:
> > kernel is from default debian 9, also tried 4.17.0-0.bpo.1-amd64, it
> > fails the same.
> >
> > [64259.850401] md/raid:md127: raid level 5 active with 2 out of 2
> > devices, algorithm 2
> > [64259.850402] RAID conf printout:
> > [64259.850404]  --- level:5 rd:2 wd:2
> > [64259.850405]  disk 0, o:1, dev:ram0
> > [64259.850407]  disk 1, o:1, dev:ram1
> > [64259.850425] md/raid456: discard support disabled due to uncertainty.
> > [64259.850427] Set raid456.devices_handle_discard_safely=Y to override.
> > [64259.850470] md127: detected capacity change from 0 to 1121976320
> > [64259.850513] md: md127 switched to read-write mode.
> > [64259.850668] md: resync of RAID array md127
> > [64259.850670] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > [64259.850681] md: using maximum available idle IO bandwidth (but not
> > more than 200000 KB/sec) for resync.
> > [64259.850713] md: using 128k window, over a total of 1095680k.
> > [64267.032621] md: md127: resync done.
> > [64267.036318] RAID conf printout:
> > [64267.036321]  --- level:5 rd:2 wd:2
> > [64267.036323]  disk 0, o:1, dev:ram0
> > [64267.036325]  disk 1, o:1, dev:ram1
> > [64270.122784] EXT4-fs (md127): mounted filesystem with ordered data
> > mode. Opts: (null)
> > [64404.464954] INFO: task fio:5136 blocked for more than 120 seconds.
> > [64404.465035]       Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-1
> > [64404.465088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [64404.465156] fio             D    0  5136   5134 0x00000000
> > [64404.465163]  ffff88a7e2457800 ffff88a7c860c000 ffff88a8192f5040
> > ffff88a836718980
> > [64404.465169]  ffff88a7c5bb8000 ffffad18016c3bd0 ffffffff8780fe79
> > ffff88a77ca18100
> > [64404.465174]  0000000000000001 ffff88a836718980 0000000000001000
> > ffff88a8192f5040
> > [64404.465180] Call Trace:
> > [64404.465191]  [<ffffffff8780fe79>] ? __schedule+0x239/0x6f0
> > [64404.465197]  [<ffffffff87810362>] ? schedule+0x32/0x80
> > [64404.465202]  [<ffffffff87813319>] ? rwsem_down_write_failed+0x1f9/0x360
> > [64404.465208]  [<ffffffff8753f033>] ? call_rwsem_down_write_failed+0x13/0x20
> > [64404.465213]  [<ffffffff878125c9>] ? down_write+0x29/0x40
> > [64404.465306]  [<ffffffffc068b1e0>] ? ext4_file_write_iter+0x50/0x370 [ext4]
>
> Looks like an ext4 problem, or possibly and aio problem.
> No evidence that it is RAID related.
> Presumably some other thread is holding the semaphore.  Finding that
> thread might help.
>
> NeilBrown
>
I was reply still too fast.  My colleague triggered hung task also
directly running IO on multiple raid5.
It's upstream 4.15.7,

[  617.690530] INFO: task fio:6440 blocked for more than 120 seconds.
[  617.690706]       Tainted: G           O     4.15.7-1-storage #1
[  617.690864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  617.691153] fio             D    0  6440   6369 0x00000000
[  617.691310] Call Trace:
[  617.691469]  ? __schedule+0x2ac/0x7e0
[  617.691630]  schedule+0x32/0x80
[  617.691811]  raid5_make_request+0x1c3/0xab0 [raid456]
[  617.691969]  ? wait_woken+0x90/0x90
[  617.692120]  md_handle_request+0xa4/0x110
[  617.692270]  md_make_request+0x64/0x160
[  617.692421]  generic_make_request+0x10d/0x2d0
[  617.692573]  ? submit_bio+0x5c/0x120
[  617.692722]  submit_bio+0x5c/0x120
[  617.692871]  ? bio_iov_iter_get_pages+0xbf/0xf0
[  617.693049]  blkdev_direct_IO+0x394/0x3d0
[  617.693202]  ? generic_file_direct_write+0xc9/0x170
[  617.693355]  generic_file_direct_write+0xc9/0x170
[  617.693507]  __generic_file_write_iter+0xb6/0x1d0
[  617.693659]  blkdev_write_iter+0x98/0x110
[  617.693809]  ? aio_write+0xeb/0x140
[  617.693958]  aio_write+0xeb/0x140
[  617.694107]  ? _cond_resched+0x15/0x30
[  617.694284]  ? mutex_lock+0xe/0x30
[  617.694433]  ? _copy_to_user+0x22/0x30
[  617.694581]  ? aio_read_events+0x2ea/0x320
[  617.694731]  ? do_io_submit+0x1f3/0x680
[  617.694881]  ? do_io_submit+0x1f3/0x680
[  617.695032]  ? do_io_submit+0x37b/0x680
[  617.695180]  do_io_submit+0x37b/0x680
[  617.695330]  ? do_syscall_64+0x5a/0x120
[  617.695509]  do_syscall_64+0x5a/0x120
[  617.695666]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  617.695823] RIP: 0033:0x7f6362428737
[  617.695972] RSP: 002b:00007ffe5daeb808 EFLAGS: 00000246 ORIG_RAX:
00000000000000d1
[  617.696217] RAX: ffffffffffffffda RBX: 00000000016be080 RCX: 00007f6362428737
[  617.696376] RDX: 0000000001ed94e8 RSI: 0000000000000067 RDI: 00007f6352a68000
[  617.696534] RBP: 00000000000000c8 R08: 0000000000000067 R09: 00000000016c2760
[  617.696716] R10: 0000000001804000 R11: 0000000000000246 R12: 00007f63454b3350
[  617.696874] R13: 0000000001ed9830 R14: 0000000000000000 R15: 00007f63454c0808

raid5_make_request+0x1c3 is sleeping at following code path:
    if (test_bit(STRIPE_EXPANDING, &sh->state) ||
               !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
               /* Stripe is busy expanding or
                * add failed due to overlap.  Flush everything
                * and wait a while
                */
               md_wakeup_thread(mddev->thread);
               raid5_release_stripe(sh);
               schedule();
               do_prepare = true;
               goto retry;
           }
Looks no one is scheduling it back.
No reshape, just fresh created 60+ raid5 devices. Pretty easy/fast to reproduce.

Is this a known bug, even better if you can point me the fix?

Thanks,
Jack