Re: RBD hard crash on kernel 3.10

Shawn Edwards <lesser.evil@xxxxxxxxx> · Fri, 10 Apr 2015 17:03:17 +0000

I took the rbd and ceph drivers out of the patched kernel above and merged them into Xen's kernel.  Works as well as the old one; still crashes.  But now I get logs.  From the Xen logs:
[   1128.217561]    ERR: 
Assertion failure in rbd_img_obj_callback() at line 2363:

	rbd_assert(more ^ (which == img_request->obj_request_count));

[   1128.217590]   WARN: ------------[ cut here ]------------
[   1128.217593]   CRIT: kernel BUG at drivers/block/rbd.c:2363!
[   1128.217596]   WARN: invalid opcode: 0000 [#1] SMP 
[   1128.217599]   WARN: Modules linked in: rbd libceph lockd sunrpc openvswitch(O) gre libcrc32c ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_multipath nls_utf8 isofs dm_mirror video backlight sbs sbshc hed acpi_ipmi ipmi_msghandler nvram hid_generic usbhid hid sg sr_mod psmouse cdrom serio_raw wmi tpm_infineon e1000e(O) tpm_tis tpm tpm_bios ehci_pci lpc_ich i2c_i801 mfd_core shpchp ptp pps_core microcode crc32_pclmul scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh dm_region_hash dm_log dm_mod ahci libahci libata sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
[   1128.217657]   WARN: CPU: 3 PID: 25236 Comm: kworker/3:3 Tainted: G           O 3.10.0+2 #1
[   1128.217660]   WARN: Hardware name: Hewlett-Packard HP Compaq 6200 Pro MT PC/1497, BIOS J01 v02.15 11/10/2011
[   1128.217671]   WARN: Workqueue: ceph-msgr con_work [libceph]
[   1128.217674]   WARN: task: ffff88001e96c530 ti: ffff880009934000 task.ti: ffff880009934000
[   1128.217677]   WARN: RIP: e030:[<ffffffffa038e081>]  [<ffffffffa038e081>] rbd_img_obj_callback+0x3b1/0x470 [rbd]
[   1128.217684]   WARN: RSP: e02b:ffff880009935b38  EFLAGS: 00010092
[   1128.217687]   WARN: RAX: 000000000000007b RBX: ffff88000cb47828 RCX: ffff88002ed8f9f0
[   1128.217690]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880009930078
[   1128.217693]   WARN: RBP: ffff880009935b88 R08: 0000000000000000 R09: ffff8800237103c0
[   1128.217696]   WARN: R10: 0000000000000001 R11: 0000000000000006 R12: 0000000000000001
[   1128.217700]   WARN: R13: 0000000000000000 R14: ffff88000cb477f8 R15: 0000000000000002
[   1128.217706]   WARN: FS:  00007f1189ea5730(0000) GS:ffff88002ed80000(0000) knlGS:0000000000000000
[   1128.217710]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   1128.217712]   WARN: CR2: ffffffffff600000 CR3: 000000000c525000 CR4: 0000000000002660
[   1128.217716]   WARN: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   1128.217719]   WARN: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   1128.217731]   WARN: Stack:
[   1128.217733]   WARN:  0000000000000000 ffff88000cb47858 ffff88000cb47834 ffff88001808cf00
[   1128.217746]   WARN:  0000000000004040 ffff880004f68580 ffff880004f68580 0000000000000004
[   1128.217750]   WARN:  ffff880024b5f768 000000000001f796 ffff880009935ba8 ffffffffa038ad61
[   1128.217755]   WARN: Call Trace:
[   1128.217760]   WARN:  [<ffffffffa038ad61>] rbd_obj_request_complete+0x51/0x70 [rbd]
[   1128.217765]   WARN:  [<ffffffffa038ffdb>] rbd_osd_req_callback+0x57b/0x5a0 [rbd]
[   1128.217773]   WARN:  [<ffffffffa035afaa>] ? __unregister_request+0x10a/0x150 [libceph]
[   1128.217780]   WARN:  [<ffffffffa035ebbf>] dispatch+0x6af/0xa10 [libceph]
[   1128.217785]   WARN:  [<ffffffff814362b4>] ? kernel_recvmsg+0x44/0x60
[   1128.217791]   WARN:  [<ffffffffa0356102>] try_read+0x1282/0x1430 [libceph]
[   1128.217795]   WARN:  [<ffffffff8101ade8>] ? __kernel_fpu_end+0x48/0x60
[   1128.217800]   WARN:  [<ffffffff8104bb5c>] ? crc32c_pcl_intel_update+0x7c/0xb0
[   1128.217803]   WARN:  [<ffffffff81436311>] ? kernel_sendmsg+0x41/0x60
[   1128.217809]   WARN:  [<ffffffffa0353c29>] ? ceph_tcp_sendmsg+0x59/0x70 [libceph]
[   1128.217815]   WARN:  [<ffffffffa0356535>] con_work+0x285/0xfe0 [libceph]
[   1128.217819]   WARN:  [<ffffffff81071b78>] process_one_work+0x238/0x390
[   1128.217823]   WARN:  [<ffffffff81072d19>] worker_thread+0x1d9/0x2c0
[   1128.217827]   WARN:  [<ffffffff81072b40>] ? manage_workers+0x1f0/0x1f0
[   1128.217831]   WARN:  [<ffffffff810780e3>] kthread+0xc3/0xd0
[   1128.217834]   WARN:  [<ffffffff8100367e>] ? xen_end_context_switch+0x1e/0x30
[   1128.217838]   WARN:  [<ffffffff81078020>] ? flush_kthread_worker+0xd0/0xd0
[   1128.217843]   WARN:  [<ffffffff8150edec>] ret_from_fork+0x7c/0xb0
[   1128.217846]   WARN:  [<ffffffff81078020>] ? flush_kthread_worker+0xd0/0xd0
[   1128.217849]   WARN: Code: 39 7e 5c 0f 94 c0 39 c2 75 25 48 c7 c1 80 33 39 a0 ba 3b 09 00 00 48 c7 c6 70 46 39 a0 48 c7 c7 e0 2b 39 a0 31 c0 e8 2f 76 cc e0 <0f> 0b eb fe 45 89 7e 40 48 8b 7d c0 e8 0e 28 c8 e0 66 90 ff 14 
[   1128.217883]  ALERT: RIP  [<ffffffffa038e081>] rbd_img_obj_callback+0x3b1/0x470 [rbd]
[   1128.217888]   WARN:  RSP <ffff880009935b38>
[   1128.225735]   WARN: ---[ end trace be7965a9853d30d1 ]---
[   1128.225814]  ALERT: BUG: unable to handle kernel paging request at ffffffffffffffd8
[   1128.225819]  ALERT: IP: [<ffffffff81077640>] kthread_data+0x10/0x20
[   1128.225831]   WARN: PGD 1a0f067 PUD 1a11067 PMD 0 
[   1128.225835]   WARN: Oops: 0000 [#2] SMP 

On Thu, Apr 9, 2015 at 5:59 AM Shawn Edwards <lesser.evil@xxxxxxxxx> wrote:
Thanks for the pointer to the patched kernel.  I'll give that a shot.

On Thu, Apr 9, 2015, 5:56 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Wed, Apr 8, 2015 at 5:25 PM, Shawn Edwards <lesser.evil@xxxxxxxxx> wrote:

> We've been working on a storage repository for xenserver 6.5, which uses the

> 3.10 kernel (ug).  I got the xenserver guys to include the rbd and libceph

> kernel modules into the 6.5 release, so that's at least available.

>

> Where things go bad is when we have many (>10 or so) VMs on one host, all

> using RBD clones for the storage mapped using the rbd kernel module.  The

> Xenserver crashes so badly that it doesn't even get a chance to kernel

> panic.  The whole box just hangs.

I'm not very familiar with Xen and ways to debug it but if the problem

lies in libceph or rbd kernel modules we'd like to fix it.  Perhaps try

grabbing a vmcore?  If it just hangs and doesn't panic you can normally

induce a crash with a sysrq.

>

> Has anyone else seen this sort of behavior?

>

> We have a lot of ways to try to work around this, but none of them are very

> pretty:

>

> * move the code to user space, ditch the kernel driver:  The build tools for

> Xenserver are all CentOS5 based, and it is painful to get all of the deps

> built to get the ceph user space libs built.

>

> * backport the ceph and rbd kernel modules to 3.10.  Has proven painful, as

> the block device code changed somewhere in the 3.14-3.16 timeframe.

https://github.com/ceph/ceph-client/commits/rhel7-3.10.0-123.9.3 branch

would be a good start - it has libceph.ko and rbd.ko as of 3.18-rc5

backported to rhel7 (which is based on 3.10) and may be updated in the

future as well, although no promises on that.

Thanks,

                Ilya

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com