Roger. Thanks for the heads-up on that. On Mon, May 20, 2013 at 2:18 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Mon, 20 May 2013, Stefan Priebe - Profihost AG wrote: >> Am 20.05.2013 um 18:29 schrieb Sage Weil <sage@xxxxxxxxxxx>: >> >> > Hi Travis, >> > >> > The fixes for this locking just went upstream for 3.10. We'll be sending >> > to Greg KH for the stable kernels shortly. >> > >> > sage >> >> But this won't change anything for 3.8 as this is eol. > > Yeah, it'll go to 3.9 and 3.4. > > sage > > >> >> > >> > >> > On Mon, 20 May 2013, Travis Rhoden wrote: >> > >> >> Sage, >> >> >> >> Did a patch for the auth code get submitted for the 3.8 kernel? I hit >> >> this again over the weekend. Looks slightly different than the last >> >> one, but still in the auth code. >> >> >> >> May 18 13:26:15 nfs1 kernel: [999560.730733] BUG: unable to handle >> >> kernel paging request at ffff880640000000 >> >> May 18 13:26:15 nfs1 kernel: [999560.737818] IP: [<ffffffff8135ca9d>] >> >> memcpy+0xd/0x110 >> >> May 18 13:26:15 nfs1 kernel: [999560.742974] PGD 1c0e063 PUD 0 >> >> May 18 13:26:15 nfs1 kernel: [999560.746150] Oops: 0000 [#1] SMP >> >> May 18 13:26:15 nfs1 kernel: [999560.749498] Modules linked in: btrfs >> >> zlib_deflate ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs reiserfs >> >> ext2 rbd libceph libcrc32c nfsd nfs_acl auth_rpcgss nfs fscache lockd >> >> coretemp sunrpc kvm gpio_ich psmouse microcode serio_raw i7core_edac >> >> ioatdma lpc_ich edac_core ipmi_si mac_hid ipmi_devintf ipmi_msghandler >> >> bonding lp parport tcp_bic raid10 raid456 async_pq async_xor xor >> >> async_memcpy async_raid6_recov hid_generic usbhid hid raid6_pq >> >> async_tx igb ahci myri10ge raid1 ptp libahci raid0 dca pps_core >> >> multipath linear >> >> May 18 13:26:15 nfs1 kernel: [999560.796421] CPU 0 >> >> May 18 13:26:15 nfs1 kernel: [999560.798353] Pid: 26234, comm: >> >> kworker/0:0 Not tainted 3.8.5-030805-generic #201303281651 Penguin >> >> Computing Relion 1751/X8DTU >> >> May 18 13:26:15 nfs1 kernel: [999560.809827] RIP: >> >> 0010:[<ffffffff8135ca9d>] [<ffffffff8135ca9d>] memcpy+0xd/0x110 >> >> May 18 13:26:15 nfs1 kernel: [999560.817403] RSP: >> >> 0018:ffff88062dc3dc40 EFLAGS: 00010246 >> >> May 18 13:26:15 nfs1 kernel: [999560.822794] RAX: ffffc90017f4301a >> >> RBX: ffff880323ba4300 RCX: 1ffff100c2f035b2 >> >> May 18 13:26:15 nfs1 kernel: [999560.830003] RDX: 0000000000000000 >> >> RSI: ffff880640000000 RDI: ffffc9002c335952 >> >> May 18 13:26:15 nfs1 kernel: [999560.837209] RBP: ffff88062dc3dc98 >> >> R08: ffffc90043b52000 R09: ffff88062dc3dad4 >> >> May 18 13:26:15 nfs1 kernel: [999560.844417] R10: ffff88027a45f0e8 >> >> R11: ffff88033fffbec0 R12: ffffc90017f4301a >> >> May 18 13:26:15 nfs1 kernel: [999560.851626] R13: 000000002bc0d708 >> >> R14: ffff880628407120 R15: 000000002bc0d6c8 >> >> May 18 13:26:15 nfs1 kernel: [999560.858834] FS: >> >> 0000000000000000(0000) GS:ffff880333c00000(0000) >> >> knlGS:0000000000000000 >> >> May 18 13:26:15 nfs1 kernel: [999560.867000] CS: 0010 DS: 0000 ES: >> >> 0000 CR0: 000000008005003b >> >> May 18 13:26:15 nfs1 kernel: [999560.872824] CR2: ffff880640000000 >> >> CR3: 0000000001c0d000 CR4: 00000000000007f0 >> >> May 18 13:26:15 nfs1 kernel: [999560.880032] DR0: 0000000000000000 >> >> DR1: 0000000000000000 DR2: 0000000000000000 >> >> May 18 13:26:15 nfs1 kernel: [999560.887239] DR3: 0000000000000000 >> >> DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> >> May 18 13:26:15 nfs1 kernel: [999560.894446] Process kworker/0:0 (pid: >> >> 26234, threadinfo ffff88062dc3c000, task ffff88032d8845c0) >> >> May 18 13:26:15 nfs1 kernel: [999560.903298] Stack: >> >> May 18 13:26:15 nfs1 kernel: [999560.905399] ffffffffa0368a54 >> >> ffffffffa035b60d 2bc0d6c8a0368d12 0000000000000098 >> >> May 18 13:26:15 nfs1 kernel: [999560.912942] 00000000000000c0 >> >> ffffffffa03687bc ffff880323ba4300 ffff880322fec4d8 >> >> May 18 13:26:15 nfs1 kernel: [999560.920471] ffff880628407120 >> >> ffff88032bdf5c40 ffff880322fec420 ffff88062dc3dcd8 >> >> May 18 13:26:15 nfs1 kernel: [999560.928016] Call Trace: >> >> May 18 13:26:15 nfs1 kernel: [999560.930561] [<ffffffffa0368a54>] ? >> >> ceph_x_build_authorizer.isra.6+0x144/0x1e0 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.938727] [<ffffffffa035b60d>] ? >> >> ceph_buffer_release+0x2d/0x50 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.945761] [<ffffffffa03687bc>] ? >> >> ceph_x_destroy_authorizer+0x2c/0x40 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.953315] [<ffffffffa0368d2e>] >> >> ceph_x_create_authorizer+0x6e/0xd0 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.960609] [<ffffffffa035db49>] >> >> get_authorizer+0x89/0xc0 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.967035] [<ffffffffa0357704>] >> >> prepare_write_connect+0xb4/0x210 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.974161] [<ffffffffa035b2a5>] >> >> try_read+0x3d5/0x430 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.980249] [<ffffffffa035b38f>] >> >> con_work+0x8f/0x140 [libceph] >> >> May 18 13:26:15 nfs1 kernel: [999560.986242] [<ffffffff81078c31>] >> >> process_one_work+0x141/0x490 >> >> May 18 13:26:15 nfs1 kernel: [999560.992153] [<ffffffff81079b08>] >> >> worker_thread+0x168/0x400 >> >> May 18 13:26:15 nfs1 kernel: [999560.997800] [<ffffffff810799a0>] ? >> >> manage_workers+0x120/0x120 >> >> May 18 13:26:15 nfs1 kernel: [999561.003713] [<ffffffff8107eff0>] >> >> kthread+0xc0/0xd0 >> >> May 18 13:26:15 nfs1 kernel: [999561.008669] [<ffffffff8107ef30>] ? >> >> flush_kthread_worker+0xb0/0xb0 >> >> May 18 13:26:15 nfs1 kernel: [999561.014927] [<ffffffff816f532c>] >> >> ret_from_fork+0x7c/0xb0 >> >> May 18 13:26:15 nfs1 kernel: [999561.020401] [<ffffffff8107ef30>] ? >> >> flush_kthread_worker+0xb0/0xb0 >> >> May 18 13:26:15 nfs1 kernel: [999561.026657] Code: 2b 43 50 88 43 4e >> >> 48 83 c4 08 5b 5d c3 90 e8 fb fd ff ff eb e6 90 90 90 90 90 90 90 90 >> >> 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 >> >> 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c >> >> May 18 13:26:15 nfs1 kernel: [999561.046667] RIP [<ffffffff8135ca9d>] >> >> memcpy+0xd/0x110 >> >> May 18 13:26:15 nfs1 kernel: [999561.051903] RSP <ffff88062dc3dc40> >> >> May 18 13:26:15 nfs1 kernel: [999561.055477] CR2: ffff880640000000 >> >> May 18 13:26:15 nfs1 kernel: [999561.058894] ---[ end trace >> >> 2fa4f8a71fe96709 ]--- >> >> >> >> Thanks! >> >> >> >> - Travis >> >> >> >> On Tue, May 7, 2013 at 10:54 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote: >> >>> Thanks Sage, I'll monitor the 3.8 point releases and update when I see >> >>> a release with those changes. >> >>> >> >>> - Travis >> >>> >> >>> On Mon, May 6, 2013 at 10:54 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> >>>> On Mon, 6 May 2013, Travis Rhoden wrote: >> >>>>> Hey folks, >> >>>>> >> >>>>> We have two servers that map a lot of RBDs (20 to 30 each so far), >> >>>>> using the RBD kernel module. They are running Ubuntu 12.10, and I >> >>>>> originally saw a lot of kernel panics (obviously from Ceph) when >> >>>>> running a 3.5.7 kernel. >> >>>>> >> >>>>> I upgrade a while back to a 3.8.5 kernel to get a much newer RBD >> >>>>> module, and the kernel panics from Ceph went away...and were replaced >> >>>>> by these nebulous "General Protection Faults" that I couldn't really >> >>>>> tell what was causing them. >> >>>>> >> >>>>> Today we saw one that actually had a Ceph backtrace in it, so I wanted >> >>>>> to throw it on here: >> >>>>> >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.423165] general protection fault: >> >>>>> 0000 [#3] SMP >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.428252] Modules linked in: rbd >> >>>>> libceph libcrc32c coretemp nfsd kvm nfs_acl auth_rpcgss nfs fscache >> >>>>> lockd sunrpc gpio_ich psmouse microcode serio_raw i7core_edac ipmi_si >> >>>>> edac_core lpc_ich ioatdma ipmi_devintf mac_hid ipmi_msghandler bonding >> >>>>> lp parport tcp_bic raid10 raid456 async_pq async_xor xor async_memcpy >> >>>>> async_raid6_recov hid_generic raid6_pq usbhid async_tx hid igb raid1 >> >>>>> myri10ge raid0 ahci ptp libahci dca pps_core multipath linear >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.468114] CPU 17 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.470133] Pid: 15920, comm: >> >>>>> kworker/17:2 Tainted: G D 3.8.5-030805-generic #201303281651 >> >>>>> Penguin Computing Relion 1751/X8DTU >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.482635] RIP: >> >>>>> 0010:[<ffffffff811851ff>] [<ffffffff811851ff>] >> >>>>> kmem_cache_alloc_trace+0x5f/0x140 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.491686] RSP: >> >>>>> 0018:ffff880624cb1a98 EFLAGS: 00010202 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.497074] RAX: 0000000000000000 >> >>>>> RBX: ffff88032ddc46d0 RCX: 000000000003c867 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.504283] RDX: 000000000003c866 >> >>>>> RSI: 0000000000008050 RDI: 0000000000016c80 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.511490] RBP: ffff880624cb1ae8 >> >>>>> R08: ffff880333d76c80 R09: 0000000000000002 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.518697] R10: ffff88032ce40070 >> >>>>> R11: 000000000000000d R12: ffff880333802200 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.525906] R13: 2e0460b9275465f2 >> >>>>> R14: ffffffffa023901e R15: 0000000000008050 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.533113] FS: >> >>>>> 0000000000000000(0000) GS:ffff880333d60000(0000) >> >>>>> knlGS:0000000000000000 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.541274] CS: 0010 DS: 0000 ES: >> >>>>> 0000 CR0: 000000008005003b >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.547095] CR2: 00007fbf9467f2b0 >> >>>>> CR3: 0000000001c0d000 CR4: 00000000000007e0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.554305] DR0: 0000000000000000 >> >>>>> DR1: 0000000000000000 DR2: 0000000000000000 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.561512] DR3: 0000000000000000 >> >>>>> DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.568720] Process kworker/17:2 >> >>>>> (pid: 15920, threadinfo ffff880624cb0000, task ffff88032b600000) >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.577656] Stack: >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.579756] 0000000000000000 >> >>>>> 0000000000000000 0000000000000060 0000000000000000 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.587292] 0000000000000000 >> >>>>> ffff88032ddc46d0 0000000000000004 ffff88032ddc46c0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.594819] ffff88032b432b30 >> >>>>> 0000000000000000 ffff880624cb1b28 ffffffffa023901e >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.602347] Call Trace: >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.604886] [<ffffffffa023901e>] >> >>>>> get_ticket_handler.isra.4+0x5e/0xc0 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.612271] [<ffffffffa02394b4>] >> >>>>> ceph_x_proc_ticket_reply+0x274/0x440 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.619740] [<ffffffffa023973d>] >> >>>>> ceph_x_handle_reply+0xbd/0x110 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.626696] [<ffffffffa023765c>] >> >>>>> ceph_handle_auth_reply+0x18c/0x200 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.633988] [<ffffffffa022d590>] >> >>>>> handle_auth_reply.isra.12+0xa0/0x230 [libceph] >> >>>> >> >>>> Ah, this is in the auth code. There was a series of patches that fixed >> >>>> the locking and a few other things that jsut went upstream for 3.10. I'll >> >>>> prepare some patches to backport those fixes to stable kernels (3.8 and >> >>>> 3.4). It could easily explain your crashes. >> >>>> >> >>>> Thanks! >> >>>> sage >> >>>> >> >>>> >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.641457] [<ffffffffa022e87d>] >> >>>>> dispatch+0xbd/0x120 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.647450] [<ffffffffa0228205>] >> >>>>> process_message+0xa5/0xc0 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.653966] [<ffffffffa022c1b1>] >> >>>>> try_read+0x2e1/0x430 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.660048] [<ffffffffa022c38f>] >> >>>>> con_work+0x8f/0x140 [libceph] >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.666043] [<ffffffff81078c31>] >> >>>>> process_one_work+0x141/0x490 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.671952] [<ffffffff81079b08>] >> >>>>> worker_thread+0x168/0x400 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.677601] [<ffffffff810799a0>] ? >> >>>>> manage_workers+0x120/0x120 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.683513] [<ffffffff8107eff0>] >> >>>>> kthread+0xc0/0xd0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.688469] [<ffffffff8107ef30>] ? >> >>>>> flush_kthread_worker+0xb0/0xb0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.694726] [<ffffffff816f532c>] >> >>>>> ret_from_fork+0x7c/0xb0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.700203] [<ffffffff8107ef30>] ? >> >>>>> flush_kthread_worker+0xb0/0xb0 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.706456] Code: 00 4d 8b 04 24 65 >> >>>>> 4c 03 04 25 08 dc 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 0f 84 cf 00 00 >> >>>>> 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 >> >>>>> 48 0f c7 0f 0f 94 c0 84 c0 74 c2 49 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.726468] RIP [<ffffffff811851ff>] >> >>>>> kmem_cache_alloc_trace+0x5f/0x140 >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.733182] RSP <ffff880624cb1a98> >> >>>>> May 6 23:02:58 nfs1 kernel: [295972.736838] ---[ end trace >> >>>>> 20e9b6a1bb611aba ]--- >> >>>>> >> >>>>> I'm not sure whether the problem started here or not. I mentioned >> >>>>> that the previous GPFs were nebulous -- one thing most of them have >> >>>>> had in common is that it's almost always from nfsd (this one isn't -- >> >>>>> first and only time I've seen this one). Howevever, I am using NFS to >> >>>>> re-export some RBDs (to provide access to multiple clients) so Ceph is >> >>>>> still in the picture on those. >> >>>>> >> >>>>> I know its not a lot to go on, but any advice would be appreciated. >> >>>>> >> >>>>> - Travis >> >>>>> -- >> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html