RE: Possible Bug in 3.8.0rc4 kernel

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 29 Jan 2013 12:22:13 -0800

Hi Christopher,

On Tue, 2013-01-29 at 14:51 +0000, Holcombe, Christopher wrote:
> Would it help to use the iscsi backend instead of fibre channel?  I
> figure that protocol should be able to deal with a certain amount of
> latency.
> 

So I think Roland's assessment of &cmd->work not being initialized when
invoked by cancel_work_sync() during core_tmr_abort_task() makes sense.

However, AFAICT with my v3.8-rc2 setup (which does have a bunch of
kernel debugging enabled), calling cancel_work_sync() on an
uninitialized cmd->work does not trigger a similar OOPs to what you've
reported.

That said, I think initializing cmd->work in transport_init_se_cmd() is
correct here to be safe in the event that the task is aborted before
target_complete_cmd() is invoked by the backend driver to call
INIT_WORK().

Can you try reproducing with your ceph backend using the following patch
as per Roland's recommendation..?

Thanks,

--nab

diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 703e46d..de6e3f6 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -1025,6 +1025,11 @@ target_cmd_size_check(struct se_cmd *cmd, unsigned int size)
 
 }
 
+static void target_work_nop(struct work_struct *work)
+{
+       return;
+}
+
 /*
  * Used by fabric modules containing a local struct se_cmd within their
  * fabric dependent per I/O descriptor.
@@ -1059,6 +1064,7 @@ void transport_init_se_cmd(
        cmd->sense_buffer = sense_buffer;
 
        cmd->state_active = false;
+       INIT_WORK(&cmd->work, target_work_nop);
 }
 EXPORT_SYMBOL(transport_init_se_cmd);
 


> -Chris
> 
> -----Original Message-----
> From: Roland Dreier [mailto:roland@xxxxxxxxxxxxxxx]
> Sent: Monday, January 28, 2013 12:15 PM
> To: Holcombe, Christopher
> Cc: target-devel@xxxxxxxxxxxxxxx
> Subject: Re: Possible Bug in 3.8.0rc4 kernel
> 
> On Mon, Jan 28, 2013 at 9:11 AM, Holcombe, Christopher <cholcomb@xxxxxxxxxxx> wrote:
> > Could latency in my ceph block device could cause this?  I am not familiar with the lio code.  When lio is writing to the block device I notice that ceph will delay writes sometimes if it is redistributing data due to an outage.
> 
> 
> Yes, if the backend is slow, then the initiator on FC might exceed its timeout and send an abort for the SCSI command that it considers timed out.  Then processing the abort gets the LIO code onto this buggy code path.
> 
>  - R.
> 
> ________________________________
> 
> NOTICE: This e-mail and any attachments is intended only for use by the addressee(s) named herein and may contain legally privileged, proprietary or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, is strictly prohibited. If you receive this email in error please immediately notify me via reply email or at (800) 927-9800 and permanently delete the original copy and any copy of any e-mail, and any printout.
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html