Re: nvmet_rdma crash - DISCONNECT event with NULL queue

Sagi Grimberg <sagi@xxxxxxxxxxx> · Wed, 2 Nov 2016 00:34:57 +0200

But:  I'll try this patch and run for a few hours and see what happens.  I
believe regardless of a keep alive issue, the above patch is still needed.

In your tests, can you enable dynamic debug on:
nvmet_start_keep_alive_timer
nvmet_stop_keep_alive_timer
nvmet_execute_keep_alive

Hey Sagi.  I hit another crash on the target.  This was with 4.8.0 + the patch
to skip disconnect events if the cm_id->qp is NULL. This time the crash is in
_raw_spin_lock_irqsave() called by nvmet_rdma_recv_done().  The log is too big
to include everything inline, but I'm attaching the full log as an attachment.
Looks like at around 4988.169 seconds in the log, we see 5 controllers created,
all named "controller 1"!  And 32 queues assigned to controller 1 5 times!  And
shortly after that we hit the BUG.

So I think you're creating multiple subsystems and provision each
subsystem differently correct? the controller ids are unique within
a subsystem so two different subsystems can have ctrl id 1. Perhaps
our logging should mention the subsysnqn too?

Anyway, is there traffic going on?

The only way we can get recv_done with corrupted data is if we posted
something after the qp drain completed, can you check if that can happen?

Can you share your test case?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html