On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hello Neil, > we are hitting issue that looks very similar; we are on kernel 3.8.2. > The array is a 2-device raid1, which experienced a drive failure, but > then drive was removed and re-added back to the array (although > rebuild never started). Relevant parts of the kernel log: > > May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on > dm-6, disabling device. > May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation > continuing on 1 devices. > May 16 11:12:14 kernel: [46850.090511] md: super_written gets > error=-5, uptodate=0 > May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6: > rescheduling sector 18040736 > May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6: > rescheduling sector 20367040 > May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6: > rescheduling sector 17613504 > May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6: > rescheduling sector 18042720 > May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting > sector 18040736 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting > sector 20367040 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting > sector 17613504 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting > sector 18042720 to other mirror: dm-13 > May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6> > May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6) > May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6> > > May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749 > blocked for more than 60 seconds. > May 16 11:24:28 kernel: [47583.921829] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > May 16 11:24:28 kernel: [47583.923361] md2_raid1 D > 0000000000000001 0 15749 2 0x00000000 > May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28 > 0000000000000046 ffff880000000002 00000000982c43b8 > May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8 > ffff880097c23fd8 ffff880097c23fd8 0000000000013f40 > May 16 11:24:28 kernel: [47583.923376] ffff880119b11740 > ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0 > May 16 11:24:28 kernel: [47583.923381] Call Trace: > May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70 > May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>] > raise_barrier+0x106/0x160 [raid1] > May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ? > add_wait_queue+0x60/0x60 > May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>] > raid1_add_disk+0x197/0x200 [raid1] > May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>] > remove_and_add_spares+0x104/0x220 > May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>] > md_check_recovery.part.49+0x40d/0x530 > May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>] > md_check_recovery+0x15/0x20 > May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>] > raid1d+0x22/0x180 [raid1] > May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ? > default_spin_lock_flags+0x9/0x10 > May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ? > default_spin_lock_flags+0x9/0x10 > May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>] > md_thread+0x10d/0x140 > May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ? > add_wait_queue+0x60/0x60 > May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ? > md_rdev_init+0x140/0x140 > May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0 > May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ? > flush_kthread_worker+0xb0/0xb0 > May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>] > ret_from_fork+0x7c/0xb0 > May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ? > flush_kthread_worker+0xb0/0xb0 > > dm-13 is the good drive, dm-6 is the one that failed. > > At this point, we have several threads calling submit_bio and all > stuck like this: > > cat /proc/16218/stack > [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1] > [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1] > [<ffffffff81561ed0>] md_make_request+0xd0/0x200 > [<ffffffff8132bcaa>] generic_make_request+0xca/0x100 > [<ffffffff8132bd5b>] submit_bio+0x7b/0x160 > ... > > And md raid1 thread stuck like this: > > cat /proc/15749/stack > [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] > [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] > [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 > [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 > [<ffffffff8156a165>] md_check_recovery+0x15/0x20 > [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] > [<ffffffff815624ed>] md_thread+0x10d/0x140 > [<ffffffff8107f0d0>] kthread+0xc0/0xd0 > [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > We have also two user-space threads stuck: > > one is trying to read /sys/block/md2/md/array_state and its kernel stack is: > # cat /proc/2251/stack > [<ffffffff81564602>] md_attr_show+0x72/0xf0 > [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 > [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > the other wants to read from /proc/mdstat and is: > [<ffffffff81563d2b>] md_seq_show+0x4b/0x540 > [<ffffffff811bdd1b>] seq_read+0x16b/0x400 > [<ffffffff811ff572>] proc_reg_read+0x82/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > mdadm --detail also gets stuck if attempted, in stack like this: > cat /proc/2864/stack > [<ffffffff81564602>] md_attr_show+0x72/0xf0 > [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 > [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > Might your patch https://patchwork.kernel.org/patch/2260051/ fix this Probably. > issue? Is this patch alone applicable to kernel 3.8.2? Probably. > Can you pls kindly comment on this. > NeilBrown
Attachment:
signature.asc
Description: PGP signature