Re: md raid10 Oops on recent kernels

NeilBrown <neilb@xxxxxxx> · Tue, 14 Aug 2012 10:50:43 +1000

On Mon, 13 Aug 2012 16:49:26 +0400 Ivan Vasilyev <ivan.vasilyev@xxxxxxxxx>
wrote:

> Hi all,
> 
> I'm using md raid over LVM on some servers (since EVMS project has
> proven to be dead), but on kernel versions 3.4 and 3.5 there is a
> problem with raid10.
> It can be reproduced on current Debian Wheezy (set up from scratch with
> 7.0beta1 installer) with kernel package v3.5 taken
> from experimental repository.
> 
> Array create, initial sync (after "dd ... of=/dev/md/rtest_a") and
> --assemble give no errors,
> but then any directIO on md device causes oops (dd without
> iflag=direct does not).
> Seems strange, but V4L capture by uvcvideo driver also freezes after first oops
> (and resumes only after mdadm --stop on problematic array)
> 
> Recent LVM2 has built-in RAID (implemented with md driver), but
> unfortunately raid10 is not supported, so it can't replace current
> setup.
> 
> Is this a bug in MD driver or in some other part of the kernel? Will it affect
> other raid setups in future? (like old one with raid0 layered over raid1)
> 
> 
> ------------------------------------------------------------
> 
> Tested on a KVM guest, so hardware seems to be irrelevant.
> Config: 1.5Gb memory, 2 vCPUs, 5 virtio disks
> 
> 
> *** Short summary of commands:
> vgcreate gurion_vg_jnt /dev/vdb6 /dev/vdc6 /dev/vdd6 /dev/vde6
> lvcreate -n rtest_a_c1r -l 129 gurion_vg_jnt /dev/vdb6
> ...
> lvcreate -n rtest_a_c4r -l 129 guiron_vg_jnt /dev/vde6
> mdadm --create /dev/md/rtest_a --verbose --metadata=1.2 \
>   --level=raid10 --raid-devices=4 --name=rtest_a \
>   --chunk=1024 --bitmap=internal \
>   /dev/gurion_vg_jnt/rtest_a_c1r /dev/gurion_vg_jnt/rtest_a_c2r \
>   /dev/gurion_vg_jnt/rtest_a_c3r /dev/gurion_vg_jnt/rtest_a_c4r
> 
> 
> Linux version 3.5-trunk-amd64 (Debian 3.5-1~experimental.1)
> (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 4.6.3 (Debian 4.6.3-1) )
> #1 SMP Thu Aug 2 17:16:27 UTC 2012
> 
> ii  linux-image-3.5-trunk-amd64                  3.5-1~experimental.1
> ii  mdadm                                        3.2.5-1
> 
> (oops is captured after "mdadm --assemble /dev/md/rtest_a" and then "lvs")
> ----------
>  BUG: unable to handle kernel paging request at ffffffff00000001
>  IP: [<ffffffff00000001>] 0xffffffff00000000
>  PGD 160d067 PUD 0
>  Oops: 0010 [#1] SMP
>  CPU 0
>  Modules linked in: appletalk ipx p8023 p8022 psnap llc rose netrom
> ax25 iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables nfsd nfs
> nfs_acl auth_rpcgss fscache lockd sunrpc loop crc32c_intel
> ghash_clmulni_intel processor aesni_intel aes_x86_64 i2c_piix4
> aes_generic cryptd thermal_sys button snd_pcm i2c_core snd_page_alloc
> snd_timer snd soundcore psmouse pcspkr serio_raw evdev microcode
> virtio_balloon ext4 crc16 jbd2 mbcache dm_mod raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor xor async_tx
> raid6_pq raid1 raid0 multipath linear md_mod sr_mod cdrom ata_generic
> virtio_net floppy virtio_blk ata_piix uhci_hcd ehci_hcd libata
> scsi_mod virtio_pci virtio_ring virtio usbcore usb_common [last
> unloaded: scsi_wait_scan]
> 
>  Pid: 11591, comm: lvs Not tainted 3.5-trunk-amd64 #1 Bochs Bochs
>  RIP: 0010:[<ffffffff00000001>]  [<ffffffff00000001>] 0xffffffff00000000
>  RSP: 0018:ffff88005a601a58  EFLAGS: 00010292
>  RAX: 0000000000100000 RBX: ffff88005cc34c80 RCX: ffff88005d334440
>  RDX: 0000000000000000 RSI: ffff88005a601a68 RDI: ffff88005b3d1c00
>  RBP: 0000000000000000 R08: ffffffffa017e99c R09: 0000000000000001
>  R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
>  R13: ffff88005cc34d00 R14: ffffea00010d7d60 R15: 0000000000000000
>  FS:  00007fd8fcef77a0(0000) GS:ffff88005f200000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: ffffffff00000001 CR3: 000000005f836000 CR4: 00000000000407f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  Process lvs (pid: 11591, threadinfo ffff88005a600000, task ffff88005f8ae040)
>  Stack:
>   ffff880054ad0c80 ffffffff81126dec ffff880057065900 0000000000000400
>   ffffea0000000000 0000000000000000 ffff88005a601b80 ffff8800575ded40
>   ffff88005a601c20 0000000000000000 0000000000000000 ffffffff811299b5
>  Call Trace:
>   [<ffffffff81126dec>] ? bio_alloc+0xe/0x1e
>   [<ffffffff811299b5>] ? dio_bio_add_page+0x16/0x4c
>   [<ffffffff81129a51>] ? dio_send_cur_page+0x66/0xa4
>   [<ffffffff8112a4dc>] ? do_blockdev_direct_IO+0x8cb/0xa81
>   [<ffffffff8125ed7e>] ? kobj_lookup+0xf6/0x12e
>   [<ffffffff811a13c7>] ? disk_map_sector_rcu+0x5d/0x5d
>   [<ffffffff811a2d9f>] ? disk_clear_events+0x3f/0xe4
>   [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
>   [<ffffffff81128000>] ? blkdev_direct_IO+0x4e/0x53
>   [<ffffffff8112873a>] ? blkdev_max_block+0x2b/0x2b
>   [<ffffffff810bbf07>] ? generic_file_aio_read+0xeb/0x5b5
>   [<ffffffff811103fd>] ? dput+0x26/0xf4
>   [<ffffffff81115b87>] ? mntput_no_expire+0x2a/0x134
>   [<ffffffff8110b3fc>] ? do_last+0x67d/0x717
>   [<ffffffff810ffe44>] ? do_sync_read+0xb4/0xec
>   [<ffffffff8110051e>] ? vfs_read+0x9f/0xe6
>   [<ffffffff811005aa>] ? sys_read+0x45/0x6b
>   [<ffffffff81364779>] ? system_call_fastpath+0x16/0x1b
>  Code:  Bad RIP value.
>  RIP  [<ffffffff00000001>] 0xffffffff00000000
>   RSP <ffff88005a601a58>
>  CR2: ffffffff00000001
>  ---[ end trace b86c49ca25a6cdb2 ]---
> ----------

It looks like the ->merge_bvec_fn is bad - the code is jumping to
0xffffffff00000001, which strongly suggests some function pointer is bad, and
merge_bvec_fn is the only one in that area of code.
However I cannot see how it could possibly get a bad value like that.

There were changes to merge_bvec_fn handling in RAID10 in 3.4 which is when
you say the problem appeared.  However I cannot see how direct IO would be
affected any differently to normal IO.

If I were to try to debug this I'd build a kernel and put a printk in
__bio_add_page in fs/bio.c just before calling q->merge_bvec_fn to print a
message if that value has the low bit set. (i.e. if (q->merge_bvec_fn & 1) ...).
I don't know if you are up for that sort of thing...

NeilBrown

Attachment:
signature.asc

Description: PGP signature