Re: SCSI regression in 4.11

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 02 Mar 2017 11:18:23 -0800

On March 2, 2017 11:05:05 AM PST, Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx> wrote:
>On Thu, 02 Mar 2017 10:36:17 -0800
>James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
>> On March 2, 2017 10:23:24 AM PST, Stephen Hemminger
><stephen@xxxxxxxxxxxxxxxxxx> wrote:
>> >On Thu, 2 Mar 2017 14:25:14 +0100
>> >Hannes Reinecke <hare@xxxxxxx> wrote:
>> >  
>> >> On 03/02/2017 02:40 AM, Stephen Hemminger wrote:  
>> >> > On Thu, 2 Mar 2017 01:56:15 +0100
>> >> > Christoph Hellwig <hch@xxxxxx> wrote:
>> >> >    
>> >> >> On Thu, Mar 02, 2017 at 01:01:35AM +0100, Christoph Hellwig
>wrote:  
>> >   
>> >> >>> On Wed, Mar 01, 2017 at 07:54:12AM -0800, Stephen Hemminger  
>> >wrote:    
>> >> >>>>>  
>>
>>	http://git.infradead.org/users/hch/block.git/commitdiff/148cff67b401e2229c076c0ea418712654be77e4
>> >   
>> >> >>>>
>> >> >>>> It appears that is already in the code I am testing in  
>> >linux-next...    
>> >> >>>
>> >> >>> It's in -next now, but it wasn't at the time you reported the 
>
>> >bug.  
>> >> >>>
>> >> >>> And it would sortof explain the bug if the INQUIRY data is  
>> >correct  
>> >> >>> in the scatterlist, but we ignore it, given that
>scsi_probe_lun
>> >> >>> ignores the result based on sense data.
>> >> >>>
>> >> >>> Can you check what happens with the horrible hack below:    
>> >> >>
>> >> >> Strike that - we're checking result later, so this can't be the
> 
>> >case.  
>> >> >>
>> >> >> Now the other interesting thing is the memset in
>__scsi_exectute,
>> >> >> which looks very suspicious.  Try the following please:
>> >> >>
>> >> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> >> >> index 3e32dc954c3c..22f4fb550561 100644
>> >> >> --- a/drivers/scsi/scsi_lib.c
>> >> >> +++ b/drivers/scsi/scsi_lib.c
>> >> >> @@ -253,7 +253,8 @@ static int __scsi_execute(struct
>scsi_device  
>> >*sdev, const unsigned char *cmd,  
>> >> >>  	 * and prevent security leaks by zeroing out the excess data.
>> >> >>  	 */
>> >> >>  	if (unlikely(rq->resid_len > 0 && rq->resid_len <= bufflen))
>> >> >> -		memset(buffer + (bufflen - rq->resid_len), 0,
>rq->resid_len);
>> >> >> +//		memset(buffer + (bufflen - rq->resid_len), 0,
>rq->resid_len);
>> >> >> +		printk_ratelimited("%s: got resid %d\n", __func__,  
>> >rq->resid_len);  
>> >> >>
>> >> >>  	if (resid)
>> >> >>  		*resid = rq->resid_len;    
>> >> >
>> >> >
>> >> > Still fails but does print resid on some of the later INQUIRY  
>> >commands (not the initial one).  
>> >> >    
>> >> Can you test what happens if you blank out the storvsc_drv  
>> >workaround:  
>> >> 
>> >> diff --git a/drivers/scsi/storvsc_drv.c
>b/drivers/scsi/storvsc_drv.c
>> >> index 585e54f..c36f42d 100644
>> >> --- a/drivers/scsi/storvsc_drv.c
>> >> +++ b/drivers/scsi/storvsc_drv.c
>> >> @@ -1060,13 +1060,13 @@ static void
>storvsc_on_io_completion(struct 
>> >> storvsc_device *stor_device,
>> >>           * We do this so we can distinguish truly fatal failues
>> >>           * (srb status == 0x4) and off-line the device in that
>case.
>> >>           */
>> >> -
>> >> +#if 0
>> >>          if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
>> >>             (stor_pkt->vm_srb.cdb[0] == MODE_SENSE)) {
>> >>                  vstor_packet->vm_srb.scsi_status = 0;
>> >>                  vstor_packet->vm_srb.srb_status =  
>> >SRB_STATUS_SUCCESS;  
>> >>          }
>> >> -
>> >> +#endif
>> >> 
>> >>          /* Copy over the status...etc */
>> >>          stor_pkt->vm_srb.scsi_status =  
>> >vstor_packet->vm_srb.scsi_status;  
>> >> 
>> >> It might thappen that we're fail to interpret the 'Device not  
>> >present'   
>> >> status correctly (which will happen for non-connected DVDs)
>causing  
>> >the   
>> >> SCSI stack to make incorrect decisions later on.
>> >> 
>> >> Cheers,
>> >> 
>> >> Hannes  
>> >
>> >There are several oddities about the host SCSI interface that I see:
>> > 1. The host bus seems to report up to 6 devices even though only 2
>are
>> >     present (Disk and CDROM).
>> >2. The CDROM emulation doesn't report the same status as a real
>device.
>> > 3. The host emulation of SCSI doesn't support all the page codes
>which
>> >     is why there is the hack.
>> >
>> >But as James said, these don't appear to be related to the failure
>> >because
>> >the code worked before and only in post 4.11 merege is there a
>problem.  
>> 
>> Your wait for the hang trace is the most suggestive.   It says we're
>waiting for a partition read to the spurious device.  Previously this
>would have failed or timed out, so this seems to be the root cause.
>> 
>> James
>> 
>> 
>
>Where is the number of valid LUN's determined during the scan process?

Depends.  If you can do a report lun scan then that's definitive.  You seem to be probing (SCSI_probe_and_add_lun)  and you make us think there's something there by responding wrongly to the initial inquiry.

James

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.