Re: Kernel crashes with RBD

"Yan, Zheng" <zheng.z.yan@xxxxxxxxxxxxxxx> · Wed, 6 Jun 2012 15:32:22 +0800

I think I tracked this bug down, the Oops is due to 'msg->bio_iter == NULL'.

---

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f0993af..ac16f13 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -549,6 +549,10 @@ static void prepare_write_message(struct
ceph_connection *con)
 	}

 	m = list_first_entry(&con->out_queue, struct ceph_msg, list_head);
+#ifdef CONFIG_BLOCK
+	if (m->bio && m->bio_iter)
+		m->bio_iter = NULL;
+#endif
 	con->out_msg = m;

 	/* put message on sent list */


On Thu, Apr 12, 2012 at 6:30 AM, Danny Kukawka <danny.kukawka@xxxxxxxxx> wrote:
> Hi,
>
> we are currently testing CEPH with RBD on a cluster with 1GBit and
> 10Gbit interfaces. While we see no kernel crashes with RBD if the
> cluster runs on the 1GBit interfaces, we see very frequent kernel
> crashes with the 10Gbit network while running tests with e.g. fio
> against the RBDs.
>
> I've tested it with kernel v3.0 and also 3.3.0 (with the patches from
> the 'for-linus' branch from ceph-client.git at git.kernel.org).
>
> With more client machines running tests the crashes occur even much
> faster. The issue is fully reproducible here.
>
> Has anyone seen similar problems? See the backtrace below.
>
> Regards
>
> Danny
>
> PID: 10902  TASK: ffff88032a9a2080  CPU: 0   COMMAND: "kworker/0:0"
>  #0 [ffff8803235fd950] machine_kexec at ffffffff810265ee
>  #1 [ffff8803235fd9a0] crash_kexec at ffffffff810a3bda
>  #2 [ffff8803235fda70] oops_end at ffffffff81444688
>  #3 [ffff8803235fda90] __bad_area_nosemaphore at ffffffff81032a35
>  #4 [ffff8803235fdb50] do_page_fault at ffffffff81446d3e
>  #5 [ffff8803235fdc50] page_fault at ffffffff81443865
>    [exception RIP: read_partial_message+816]
>    RIP: ffffffffa041e500  RSP: ffff8803235fdd00  RFLAGS: 00010246
>    RAX: 0000000000000000  RBX: 00000000000009d7  RCX: 0000000000008000
>    RDX: 0000000000000000  RSI: 00000000000009d7  RDI: ffffffff813c8d78
>    RBP: ffff880328827030   R8: 00000000000009d7   R9: 0000000000004000
>    R10: 0000000000000000  R11: ffffffff81205800  R12: 0000000000000000
>    R13: 0000000000000069  R14: ffff88032a9bc780  R15: 0000000000000000
>    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffff8803235fdd38] thread_return at ffffffff81440e82
>  #7 [ffff8803235fdd78] try_read at ffffffffa041ed58 [libceph]
>  #8 [ffff8803235fddf8] con_work at ffffffffa041fb2e [libceph]
>  #9 [ffff8803235fde28] process_one_work at ffffffff8107487c
> #10 [ffff8803235fde78] worker_thread at ffffffff8107740a
> #11 [ffff8803235fdee8] kthread at ffffffff8107b736
> #12 [ffff8803235fdf48] kernel_thread_helper at ffffffff8144c144
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html