On Thu, 02 Mar 2017 11:18:23 -0800 James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > On March 2, 2017 11:05:05 AM PST, Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx> wrote: > >On Thu, 02 Mar 2017 10:36:17 -0800 > >James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > >> On March 2, 2017 10:23:24 AM PST, Stephen Hemminger > ><stephen@xxxxxxxxxxxxxxxxxx> wrote: > >> >On Thu, 2 Mar 2017 14:25:14 +0100 > >> >Hannes Reinecke <hare@xxxxxxx> wrote: > >> > > >> >> On 03/02/2017 02:40 AM, Stephen Hemminger wrote: > >> >> > On Thu, 2 Mar 2017 01:56:15 +0100 > >> >> > Christoph Hellwig <hch@xxxxxx> wrote: > >> >> > > >> >> >> On Thu, Mar 02, 2017 at 01:01:35AM +0100, Christoph Hellwig > >wrote: > >> > > >> >> >>> On Wed, Mar 01, 2017 at 07:54:12AM -0800, Stephen Hemminger > >> >wrote: > >> >> >>>>> > >> > >> http://git.infradead.org/users/hch/block.git/commitdiff/148cff67b401e2229c076c0ea418712654be77e4 > >> > > >> >> >>>> > >> >> >>>> It appears that is already in the code I am testing in > >> >linux-next... > >> >> >>> > >> >> >>> It's in -next now, but it wasn't at the time you reported the > > > >> >bug. > >> >> >>> > >> >> >>> And it would sortof explain the bug if the INQUIRY data is > >> >correct > >> >> >>> in the scatterlist, but we ignore it, given that > >scsi_probe_lun > >> >> >>> ignores the result based on sense data. > >> >> >>> > >> >> >>> Can you check what happens with the horrible hack below: > >> >> >> > >> >> >> Strike that - we're checking result later, so this can't be the > > > >> >case. > >> >> >> > >> >> >> Now the other interesting thing is the memset in > >__scsi_exectute, > >> >> >> which looks very suspicious. Try the following please: > >> >> >> > >> >> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > >> >> >> index 3e32dc954c3c..22f4fb550561 100644 > >> >> >> --- a/drivers/scsi/scsi_lib.c > >> >> >> +++ b/drivers/scsi/scsi_lib.c > >> >> >> @@ -253,7 +253,8 @@ static int __scsi_execute(struct > >scsi_device > >> >*sdev, const unsigned char *cmd, > >> >> >> * and prevent security leaks by zeroing out the excess data. > >> >> >> */ > >> >> >> if (unlikely(rq->resid_len > 0 && rq->resid_len <= bufflen)) > >> >> >> - memset(buffer + (bufflen - rq->resid_len), 0, > >rq->resid_len); > >> >> >> +// memset(buffer + (bufflen - rq->resid_len), 0, > >rq->resid_len); > >> >> >> + printk_ratelimited("%s: got resid %d\n", __func__, > >> >rq->resid_len); > >> >> >> > >> >> >> if (resid) > >> >> >> *resid = rq->resid_len; > >> >> > > >> >> > > >> >> > Still fails but does print resid on some of the later INQUIRY > >> >commands (not the initial one). > >> >> > > >> >> Can you test what happens if you blank out the storvsc_drv > >> >workaround: > >> >> > >> >> diff --git a/drivers/scsi/storvsc_drv.c > >b/drivers/scsi/storvsc_drv.c > >> >> index 585e54f..c36f42d 100644 > >> >> --- a/drivers/scsi/storvsc_drv.c > >> >> +++ b/drivers/scsi/storvsc_drv.c > >> >> @@ -1060,13 +1060,13 @@ static void > >storvsc_on_io_completion(struct > >> >> storvsc_device *stor_device, > >> >> * We do this so we can distinguish truly fatal failues > >> >> * (srb status == 0x4) and off-line the device in that > >case. > >> >> */ > >> >> - > >> >> +#if 0 > >> >> if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) || > >> >> (stor_pkt->vm_srb.cdb[0] == MODE_SENSE)) { > >> >> vstor_packet->vm_srb.scsi_status = 0; > >> >> vstor_packet->vm_srb.srb_status = > >> >SRB_STATUS_SUCCESS; > >> >> } > >> >> - > >> >> +#endif > >> >> > >> >> /* Copy over the status...etc */ > >> >> stor_pkt->vm_srb.scsi_status = > >> >vstor_packet->vm_srb.scsi_status; > >> >> > >> >> It might thappen that we're fail to interpret the 'Device not > >> >present' > >> >> status correctly (which will happen for non-connected DVDs) > >causing > >> >the > >> >> SCSI stack to make incorrect decisions later on. > >> >> > >> >> Cheers, > >> >> > >> >> Hannes > >> > > >> >There are several oddities about the host SCSI interface that I see: > >> > 1. The host bus seems to report up to 6 devices even though only 2 > >are > >> > present (Disk and CDROM). > >> >2. The CDROM emulation doesn't report the same status as a real > >device. > >> > 3. The host emulation of SCSI doesn't support all the page codes > >which > >> > is why there is the hack. > >> > > >> >But as James said, these don't appear to be related to the failure > >> >because > >> >the code worked before and only in post 4.11 merege is there a > >problem. > >> > >> Your wait for the hang trace is the most suggestive. It says we're > >waiting for a partition read to the spurious device. Previously this > >would have failed or timed out, so this seems to be the root cause. > >> > >> James > >> > >> > > > >Where is the number of valid LUN's determined during the scan process? > > Depends. If you can do a report lun scan then that's definitive. You seem to be probing (SCSI_probe_and_add_lun) and you make us think there's something there by responding wrongly to the initial inquiry. Testing a fix now. There looks like 3 problems here: 1. storvsc_io_completion masks all error responses from INQUIRY 2. Error handling in storvsc does not report invalid LUN correctly. 3. Block layer has new problems when device is in bad state (not present and timing out). The first two have been there for 4 years but did not cause problems. Something happened that made kernel chew lots of resources and eventually die when it hits a disconnected device that is not detected properly.