Re: RAID1 sometimes have different data on the slave devices

Jinpu Wang <jinpuwang@xxxxxxxxx> · Thu, 23 Aug 2018 09:44:38 +0200

NeilBrown <neilb@xxxxxxxx> 于2018年8月22日周三 下午10:51写道：
>
> On Wed, Aug 22 2018, Jinpu Wang wrote:
> >>
> > I was reply still too fast.  My colleague triggered hung task also
> > directly running IO on multiple raid5.
> > It's upstream 4.15.7,
> >
> > [  617.690530] INFO: task fio:6440 blocked for more than 120 seconds.
> > [  617.690706]       Tainted: G           O     4.15.7-1-storage #1
> > [  617.690864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  617.691153] fio             D    0  6440   6369 0x00000000
> > [  617.691310] Call Trace:
> > [  617.691469]  ? __schedule+0x2ac/0x7e0
> > [  617.691630]  schedule+0x32/0x80
> > [  617.691811]  raid5_make_request+0x1c3/0xab0 [raid456]
> > [  617.691969]  ? wait_woken+0x90/0x90
> > [  617.692120]  md_handle_request+0xa4/0x110
> > [  617.692270]  md_make_request+0x64/0x160
> > [  617.692421]  generic_make_request+0x10d/0x2d0
> > [  617.692573]  ? submit_bio+0x5c/0x120
> > [  617.692722]  submit_bio+0x5c/0x120
> > [  617.692871]  ? bio_iov_iter_get_pages+0xbf/0xf0
> > [  617.693049]  blkdev_direct_IO+0x394/0x3d0
> > [  617.693202]  ? generic_file_direct_write+0xc9/0x170
> > [  617.693355]  generic_file_direct_write+0xc9/0x170
> > [  617.693507]  __generic_file_write_iter+0xb6/0x1d0
> > [  617.693659]  blkdev_write_iter+0x98/0x110
> > [  617.693809]  ? aio_write+0xeb/0x140
> > [  617.693958]  aio_write+0xeb/0x140
> > [  617.694107]  ? _cond_resched+0x15/0x30
> > [  617.694284]  ? mutex_lock+0xe/0x30
> > [  617.694433]  ? _copy_to_user+0x22/0x30
> > [  617.694581]  ? aio_read_events+0x2ea/0x320
> > [  617.694731]  ? do_io_submit+0x1f3/0x680
> > [  617.694881]  ? do_io_submit+0x1f3/0x680
> > [  617.695032]  ? do_io_submit+0x37b/0x680
> > [  617.695180]  do_io_submit+0x37b/0x680
> > [  617.695330]  ? do_syscall_64+0x5a/0x120
> > [  617.695509]  do_syscall_64+0x5a/0x120
> > [  617.695666]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [  617.695823] RIP: 0033:0x7f6362428737
> > [  617.695972] RSP: 002b:00007ffe5daeb808 EFLAGS: 00000246 ORIG_RAX:
> > 00000000000000d1
> > [  617.696217] RAX: ffffffffffffffda RBX: 00000000016be080 RCX: 00007f6362428737
> > [  617.696376] RDX: 0000000001ed94e8 RSI: 0000000000000067 RDI: 00007f6352a68000
> > [  617.696534] RBP: 00000000000000c8 R08: 0000000000000067 R09: 00000000016c2760
> > [  617.696716] R10: 0000000001804000 R11: 0000000000000246 R12: 00007f63454b3350
> > [  617.696874] R13: 0000000001ed9830 R14: 0000000000000000 R15: 00007f63454c0808
> >
> > raid5_make_request+0x1c3 is sleeping at following code path:
> >     if (test_bit(STRIPE_EXPANDING, &sh->state) ||
> >                !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
> >                /* Stripe is busy expanding or
> >                 * add failed due to overlap.  Flush everything
> >                 * and wait a while
> >                 */
> >                md_wakeup_thread(mddev->thread);
> >                raid5_release_stripe(sh);
> >                schedule();
> >                do_prepare = true;
> >                goto retry;
> >            }
> > Looks no one is scheduling it back.
> > No reshape, just fresh created 60+ raid5 devices. Pretty easy/fast to
> > reproduce.
>
> Presumably it is an overlap, so R5_Overlap should be set.
> do_prepare is set, so prepare_to_wait() should have been called on
> wait_for_overlap.
>
> So maybe some code patch isn't checking R5_Overlap and so isn't doing
> the wakeup.
>
> NeilBrown
>
> >
> > Is this a known bug, even better if you can point me the fix?
Thanks Neil,

We applied 448ec638c6bc ("md/raid5: Assigning NULL to sh->batch_head
before testing bit R5_Overlap of a stripe"),
We can no longer trigger the IO hung, still testing, but looks promissing.

Will report back, if we still see problem.

Regards,
Jack