On Apr 10, 2012, at 1:16 AM, Bart Van Assche wrote: > On 04/10/12 01:22, Elric Fu wrote: > >> After debugging the code, I found the issue happened while the driver ran to >> line 782 in scsi_send_eh_cmnd(). >> >> 778 static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd, >> 779 int cmnd_size, int timeout, unsigned >> sense_bytes) >> 780 { >> 781 struct scsi_device *sdev = scmd->device; >> 782 struct scsi_driver *sdrv = scsi_cmd_to_driver(scmd); >> 783 struct Scsi_Host *shost = sdev->host; >> 784 DECLARE_COMPLETION_ONSTACK(done); >> 785 unsigned long timeleft; >> 786 struct scsi_eh_save ses; >> 787 int rtn; >> >> I know the code is submitted by you. I don't familiar with the scsi core. >> It seems like the conversion process from scsi command to scsi driver >> encounter a NULL pointer. Any idea? > > I have observed crashes at the same point while testing device removal > with the ib_srp driver. As far as I can see that code was added through > commit 18a4d0a22ed6c54b67af7718c305cd010f09ddf8 (February 9, 2012). The > approach of that patch looks questionable to me: what guarantees that > the struct scsi_driver will be available at the time the SCSI error > handler needs it ? At least the sd driver explicitly resets that pointer > in its scsi_disk_release() function. I am looking into a similar crash with FCoE, though in my case it is the private_data field that is NULL instead of rq_disk. The backtraces are very much like what has been reported here. I will try adding some NULL checks similar to what has been proposed on the list, but until I know more than I do now, I won't let myself believe that NULL checks are the real fix for this issue. -- Mark Rustad, LAN Access Division, Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html