Hi, I'm trying to test the btrfs and ceph contributions to 3.11, without testing all of 3.11-rc1 (just yet), so I'm testing with the "next" branch of Chris Mason's tree (commit cbacd76bb3 from git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git) merged into the for-linus branch of the ceph tree (commit 8b8cf8917f from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git) One of my ceph clients hit this: [94633.463166] BUG: unable to handle kernel paging request at ffffffffffffffa8 [94633.464003] IP: [<ffffffff8106a070>] kthread_data+0x10/0x20 [94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 [94633.464003] Oops: 0000 [#2] SMP [94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core dca hwmon e1000 [94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G D W 3.10.0-00119-g2925339 #601 [94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 06/21/2006 [94633.464003] task: ffff880415b60000 ti: ffff88040e39a000 task.ti: ffff88040e39a000 [94633.464003] RIP: 0010:[<ffffffff8106a070>] [<ffffffff8106a070>] kthread_data+0x10/0x20 [94633.464003] RSP: 0018:ffff88040e39b7f8 EFLAGS: 00010092 [94633.464003] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff81d30320 [94633.464003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880415b60000 [94633.464003] RBP: ffff88040e39b7f8 R08: ffff880415b60070 R09: 0000000000000001 [94633.464003] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [94633.464003] R13: ffff880415b603e8 R14: 0000000000000001 R15: 0000000000000002 [94633.464003] FS: 0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000 [94633.464003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [94633.464003] CR2: 0000000000000028 CR3: 0000000415f77000 CR4: 00000000000007f0 [94633.464003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [94633.464003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [94633.464003] Stack: [94633.464003] ffff88040e39b818 ffffffff810602a5 ffff88040e39b818 ffff88042fc139c0 [94633.464003] ffff88040e39b8a8 ffffffff814ef79e ffff880400000000 ffff88040e39bfd8 [94633.464003] ffff88040e39a000 ffff88040e39a000 ffff88040e39a010 ffff88040e39a000 [94633.464003] Call Trace: [94633.464003] [<ffffffff810602a5>] wq_worker_sleeping+0x15/0xa0 [94633.464003] [<ffffffff814ef79e>] __schedule+0x17e/0x6b0 [94633.464003] [<ffffffff814efefd>] schedule+0x5d/0x60 [94633.464003] [<ffffffff8104717b>] do_exit+0x3eb/0x440 [94633.464003] [<ffffffff814f33f8>] oops_end+0xd8/0xf0 [94633.464003] [<ffffffff810362df>] no_context+0x1bf/0x1e0 [94633.464003] [<ffffffff810364f5>] __bad_area_nosemaphore+0x1f5/0x230 [94633.464003] [<ffffffff81036543>] bad_area_nosemaphore+0x13/0x20 [94633.464003] [<ffffffff814f6406>] __do_page_fault+0x416/0x4b0 [94633.464003] [<ffffffff810869ae>] ? idle_balance+0x14e/0x180 [94633.464003] [<ffffffff81077a1f>] ? finish_task_switch+0x3f/0x110 [94633.464003] [<ffffffff814f29e3>] ? error_sti+0x5/0x6 [94633.464003] [<ffffffff8109e859>] ? trace_hardirqs_off_caller+0x29/0xd0 [94633.464003] [<ffffffff8128c6dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [94633.464003] [<ffffffff814f64ae>] do_page_fault+0xe/0x10 [94633.464003] [<ffffffff814f27e2>] page_fault+0x22/0x30 [94633.464003] [<ffffffff81285a47>] ? rb_erase+0x297/0x3a0 [94633.464003] [<ffffffffa02b45d8>] __remove_osd+0x98/0xd0 [libceph] [94633.464003] [<ffffffffa02b49c3>] __reset_osd+0xa3/0x1c0 [libceph] [94633.464003] [<ffffffffa02b6c5b>] ? osd_reset+0x9b/0xd0 [libceph] [94633.464003] [<ffffffffa02b695b>] __kick_osd_requests+0x7b/0x2e0 [libceph] [94633.464003] [<ffffffffa02b6c66>] osd_reset+0xa6/0xd0 [libceph] [94633.464003] [<ffffffffa02aeb65>] con_work+0x445/0x4a0 [libceph] [94633.464003] [<ffffffff810635b5>] process_one_work+0x2e5/0x510 [94633.464003] [<ffffffff81063510>] ? process_one_work+0x240/0x510 [94633.464003] [<ffffffff81064975>] worker_thread+0x215/0x340 [94633.464003] [<ffffffff81064760>] ? manage_workers+0x170/0x170 [94633.464003] [<ffffffff8106aa61>] kthread+0xe1/0xf0 [94633.464003] [<ffffffff8106a980>] ? __init_kthread_worker+0x70/0x70 [94633.464003] [<ffffffff814faf5c>] ret_from_fork+0x7c/0xb0 [94633.464003] [<ffffffff8106a980>] ? __init_kthread_worker+0x70/0x70 [94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 <48> 8b 40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 [94633.464003] RIP [<ffffffff8106a070>] kthread_data+0x10/0x20 [94633.464003] RSP <ffff88040e39b7f8> [94633.464003] CR2: ffffffffffffffa8 [94633.464003] ---[ end trace 89622896705a7fac ]--- [94633.464003] Fixing recursive fault but reboot is needed! [94633.464003] ------------[ cut here ]------------ kthread_data disassembles to this: (gdb) disassemble kthread_data Dump of assembler code for function kthread_data: 0xffffffff8106a060 <+0>: push %rbp 0xffffffff8106a061 <+1>: mov %rsp,%rbp 0xffffffff8106a064 <+4>: callq 0xffffffff814fabc0 0xffffffff8106a069 <+9>: mov 0x390(%rdi),%rax 0xffffffff8106a070 <+16>: mov -0x58(%rax),%rax 0xffffffff8106a074 <+20>: leaveq 0xffffffff8106a075 <+21>: retq End of assembler dump. and scripts/decodecode had this to say: All code ======== 0: 90 nop 1: 03 00 add (%rax),%eax 3: 00 48 8b add %cl,-0x75(%rax) 6: 40 98 rex cwtl 8: c9 leaveq 9: 48 c1 e8 02 shr $0x2,%rax d: 83 e0 01 and $0x1,%eax 10: c3 retq 11: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 18: 00 00 00 1b: 55 push %rbp 1c: 48 89 e5 mov %rsp,%rbp 1f: 66 66 66 66 90 data32 data32 data32 xchg %ax,%ax 24: 48 8b 87 90 03 00 00 mov 0x390(%rdi),%rax 2b:* 48 8b 40 a8 mov -0x58(%rax),%rax <-- trapping instruction 2f: c9 leaveq 30: c3 retq 31: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 38: 00 00 00 3b: 55 push %rbp 3c: 48 89 e5 mov %rsp,%rbp 3f: 66 data16 So, I think that all means that __schedule() called wq_worker_sleeping() for a task whose vfork_done completion pointer was NULL, and to_kthread() tried to use it. Assuming I got that right, that's where I get stuck - I don't have a clue where to go next to figure out what caused it. So far I've only triggered this one instance, so I don't know how repeatable this is. Any ideas where I should look for what might be going wrong? Thanks in advance for any help anyone can give me. -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html