Hello.
Bug happens in rbd client, at least in Kernel 3.4.4 . I have a
completely reproductible bug.
here is the oops :
Jul 6 10:16:52 label5.u14.univ-nantes.prive kernel: [ 329.456285]
EXT4-fs (rbd1): mounted filesystem with ordered data mode. Opts: (null)
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.709145]
libceph: osd1 172.20.14.131:6801 socket closed
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715245] BUG:
unable to handle kernel NULL pointer dereference at 0000000000000048
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715430] IP:
[<ffffffffa08488f0>] con_work+0xfb0/0x20b0 [libceph]
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715554] PGD
a094cb067 PUD a0a7a7067 PMD 0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715758]
Oops: 0000 [#1] SMP
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715914] CPU 0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.715963]
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn
ip6table_filter ip6_tables iptable_filt
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720338]
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720406] Pid:
1007, comm: kworker/0:2 Not tainted 3.4.4-dsiun-120521 #111 Dell Inc.
PowerEdge M610/0V56FN
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720637] RIP:
0010:[<ffffffffa08488f0>] [<ffffffffa08488f0>] con_work+0xfb0/0x20b0
[libceph]
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720779] RSP:
0000:ffff880a1036dd50 EFLAGS: 00010246
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720851] RAX:
0000000000000000 RBX: 0000000000000000 RCX: 0000000000031000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.720925] RDX:
0000000000000000 RSI: ffff880a1092c5a0 RDI: ffff880a1092c598
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721002] RBP:
000000000004f000 R08: 0000000000000020 R09: 0000000000000000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721100] R10:
0000000000000010 R11: ffff880a122e0f08 R12: 0000000000000001
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721173] R13:
ffff880a1092c500 R14: ffffea001430e300 R15: ffff880a0990f030
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721247] FS:
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721337] CS:
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721409] CR2:
0000000000000048 CR3: 0000000a10823000 CR4: 00000000000007f0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721483] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721557] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721632]
Process kworker/0:2 (pid: 1007, threadinfo ffff880a1036c000, task
ffff880a10b2f2c0)
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721721] Stack:
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.721784]
0000000200000000 ffff880a1036ddfc 0000000000000400 ffff880a00000000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722050]
ffff880a1036ddd8 000000000004f000 ffff880a0004f000 ffff880a00000000
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722315]
ffff880a0990f420 ffff880a1092c5a0 ffff880a0990f308 ffff880a0990f1a8
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722581] Call
Trace:
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722653]
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722728]
[<ffffffffa0847940>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722819]
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722892]
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.722969]
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.723042]
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.723116]
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.723189]
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.723258]
Code: ea f4 ff ff 0f 1f 80 00 00 00 00 49 83 bd 90 00 00 00 00 0f 84 ca
03 00 00 49 63 85 a0 00 00 00 49
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.727478] RIP
[<ffffffffa08488f0>] con_work+0xfb0/0x20b0 [libceph]
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.727599] RSP
<ffff880a1036dd50>
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.727664] CR2:
0000000000000048
Jul 6 10:18:38 label5.u14.univ-nantes.prive kernel: [ 434.727846] ---[
end trace 100f342b55356819 ]---
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728031] BUG:
unable to handle kernel paging request at fffffffffffffff8
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728192] IP:
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728313] PGD
14fe067 PUD 14ff067 PMD 0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728517]
Oops: 0000 [#2] SMP
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728676] CPU 0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.728725]
Modules linked in: ext4 jbd2 crc16 rbd libceph drbd lru_cache cn
ip6table_filter ip6_tables iptable_filt
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733034]
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733100] Pid:
1007, comm: kworker/0:2 Tainted: G D 3.4.4-dsiun-120521 #111 Dell
Inc. PowerEdge M610/0V5
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733330] RIP:
0010:[<ffffffff81059d27>] [<ffffffff81059d27>] kthread_data+0x7/0x10
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733470] RSP:
0000:ffff880a1036da30 EFLAGS: 00010002
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733539] RAX:
0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733612] RDX:
ffffffff8164a380 RSI: 0000000000000000 RDI: ffff880a10b2f2c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733686] RBP:
ffff880a10b2f2c0 R08: 0000000000989680 R09: ffffffff8164a380
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733758] R10:
0000000000000800 R11: 000000000000fff8 R12: ffff880a2fc120c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733830] R13:
0000000000000000 R14: ffff880a10b2f2b0 R15: ffff880a10b2f2c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733904] FS:
0000000000000000(0000) GS:ffff880a2fc00000(0000) knlGS:0000000000000000
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.733993] CS:
0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734064] CR2:
fffffffffffffff8 CR3: 0000000a10823000 CR4: 00000000000007f0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734138] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734211] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734284]
Process kworker/0:2 (pid: 1007, threadinfo ffff880a1036c000, task
ffff880a10b2f2c0)
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734375] Stack:
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734439]
ffffffff81055ae8 ffff880a10b2f590 ffffffff813b807d ffff880a10b2f2c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734706]
ffff880a10b2f2c0 ffff880a1036dfd8 ffff880a1036dfd8 ffff880a1036dfd8
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.734971]
ffff880a10b2f2c0 0000000000000001 ffff880a10b2f7a4 0000000000000000
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735237] Call
Trace:
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735309]
[<ffffffff81055ae8>] ? wq_worker_sleeping+0x8/0x90
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735386]
[<ffffffff813b807d>] ? __schedule+0x41d/0x6c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735463]
[<ffffffff8103e2a2>] ? do_exit+0x592/0x8c0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735537]
[<ffffffff81006068>] ? oops_end+0x98/0xe0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735611]
[<ffffffff813b0f96>] ? no_context+0x24e/0x279
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735685]
[<ffffffff8102e31b>] ? do_page_fault+0x3ab/0x460
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735760]
[<ffffffff8135677b>] ? tcp_established_options+0x3b/0xd0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735833]
[<ffffffff813589aa>] ? tcp_write_xmit+0x15a/0xac0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735907]
[<ffffffff813b9179>] ? _raw_spin_lock_bh+0x9/0x30
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.735984]
[<ffffffff812f9a79>] ? release_sock+0x19/0x100
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736056]
[<ffffffff8134af43>] ? tcp_sendpage+0xf3/0x700
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736131]
[<ffffffff813b94f5>] ? page_fault+0x25/0x30
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736206]
[<ffffffffa08488f0>] ? con_work+0xfb0/0x20b0 [libceph]
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736280]
[<ffffffff810534d2>] ? process_one_work+0x122/0x3f0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736355]
[<ffffffffa0847940>] ? ceph_con_revoke_message+0xc0/0xc0 [libceph]
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736446]
[<ffffffff81054c65>] ? worker_thread+0x125/0x2e0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736518]
[<ffffffff81054b40>] ? manage_workers.isra.25+0x1f0/0x1f0
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736593]
[<ffffffff81059b85>] ? kthread+0x85/0x90
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736664]
[<ffffffff813baee4>] ? kernel_thread_helper+0x4/0x10
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736739]
[<ffffffff81059b00>] ? flush_kthread_worker+0x80/0x80
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736813]
[<ffffffff813baee0>] ? gs_change+0x13/0x13
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.736883]
Code: fe ff ff 90 eb 90 be 57 01 00 00 48 c7 c7 9b 70 47 81 e8 cd 00 fe
ff e9 94 fe ff ff 0f 1f 84 00 00
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.739914] RIP
[<ffffffff81059d27>] kthread_data+0x7/0x10
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.740036] RSP
<ffff880a1036da30>
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.740103] CR2:
fffffffffffffff8
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.740170] ---[
end trace 100f342b5535681a ]---
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 434.740250]
Fixing recursive fault but reboot is needed!
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 494.699770]
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 16,
t=6002 jiffies)
Jul 6 10:19:38 label5.u14.univ-nantes.prive kernel: [ 494.700039]
INFO: Stall ended before state dump start
Step (for me) to reproduce :
the volume is on my freshly re-created ceph with 8 osd nodes (xfs
formatted osd). I created an rbd volume (yd-bench) on it.
this rbd volume is ext4 formatted.
It only contains copy of the git trunk linux-stable
Then, on a client (running nothing ceph-related)
modprobe rbd
rbd map yd-bench
mount
cd linux-stable
make -j24 bzImage modules
is sufficient to lead to the crash. The machine has 32GB of ram, 64
bits, and the same works on localdisk.
kernel is vanilla 3.4.4.
Any ideas ?
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html