Re: RAID1 sometimes have different data on the slave devices

NeilBrown <neilb@xxxxxxxx> · Thu, 23 Aug 2018 06:51:25 +1000

On Wed, Aug 22 2018, Jinpu Wang wrote:
>>
> I was reply still too fast.  My colleague triggered hung task also
> directly running IO on multiple raid5.
> It's upstream 4.15.7,
>
> [  617.690530] INFO: task fio:6440 blocked for more than 120 seconds.
> [  617.690706]       Tainted: G           O     4.15.7-1-storage #1
> [  617.690864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  617.691153] fio             D    0  6440   6369 0x00000000
> [  617.691310] Call Trace:
> [  617.691469]  ? __schedule+0x2ac/0x7e0
> [  617.691630]  schedule+0x32/0x80
> [  617.691811]  raid5_make_request+0x1c3/0xab0 [raid456]
> [  617.691969]  ? wait_woken+0x90/0x90
> [  617.692120]  md_handle_request+0xa4/0x110
> [  617.692270]  md_make_request+0x64/0x160
> [  617.692421]  generic_make_request+0x10d/0x2d0
> [  617.692573]  ? submit_bio+0x5c/0x120
> [  617.692722]  submit_bio+0x5c/0x120
> [  617.692871]  ? bio_iov_iter_get_pages+0xbf/0xf0
> [  617.693049]  blkdev_direct_IO+0x394/0x3d0
> [  617.693202]  ? generic_file_direct_write+0xc9/0x170
> [  617.693355]  generic_file_direct_write+0xc9/0x170
> [  617.693507]  __generic_file_write_iter+0xb6/0x1d0
> [  617.693659]  blkdev_write_iter+0x98/0x110
> [  617.693809]  ? aio_write+0xeb/0x140
> [  617.693958]  aio_write+0xeb/0x140
> [  617.694107]  ? _cond_resched+0x15/0x30
> [  617.694284]  ? mutex_lock+0xe/0x30
> [  617.694433]  ? _copy_to_user+0x22/0x30
> [  617.694581]  ? aio_read_events+0x2ea/0x320
> [  617.694731]  ? do_io_submit+0x1f3/0x680
> [  617.694881]  ? do_io_submit+0x1f3/0x680
> [  617.695032]  ? do_io_submit+0x37b/0x680
> [  617.695180]  do_io_submit+0x37b/0x680
> [  617.695330]  ? do_syscall_64+0x5a/0x120
> [  617.695509]  do_syscall_64+0x5a/0x120
> [  617.695666]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  617.695823] RIP: 0033:0x7f6362428737
> [  617.695972] RSP: 002b:00007ffe5daeb808 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000d1
> [  617.696217] RAX: ffffffffffffffda RBX: 00000000016be080 RCX: 00007f6362428737
> [  617.696376] RDX: 0000000001ed94e8 RSI: 0000000000000067 RDI: 00007f6352a68000
> [  617.696534] RBP: 00000000000000c8 R08: 0000000000000067 R09: 00000000016c2760
> [  617.696716] R10: 0000000001804000 R11: 0000000000000246 R12: 00007f63454b3350
> [  617.696874] R13: 0000000001ed9830 R14: 0000000000000000 R15: 00007f63454c0808
>
> raid5_make_request+0x1c3 is sleeping at following code path:
>     if (test_bit(STRIPE_EXPANDING, &sh->state) ||
>                !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
>                /* Stripe is busy expanding or
>                 * add failed due to overlap.  Flush everything
>                 * and wait a while
>                 */
>                md_wakeup_thread(mddev->thread);
>                raid5_release_stripe(sh);
>                schedule();
>                do_prepare = true;
>                goto retry;
>            }
> Looks no one is scheduling it back.
> No reshape, just fresh created 60+ raid5 devices. Pretty easy/fast to
> reproduce.

Presumably it is an overlap, so R5_Overlap should be set.
do_prepare is set, so prepare_to_wait() should have been called on
wait_for_overlap.

So maybe some code patch isn't checking R5_Overlap and so isn't doing
the wakeup.

NeilBrown

>
> Is this a known bug, even better if you can point me the fix?
>
> Thanks,
> Jack
Attachment:
signature.asc

Description: PGP signature