On 07/16/2015 08:55 PM, Kevin Groeneveld wrote: >> -----Original Message----- >> From: Hannes Reinecke [mailto:hare@xxxxxxx] >> Sent: July-16-15 7:11 AM >>> When the hang occurs shost->host_busy == 2 and shost->host_failed == 1 >>> in the scsi_eh_wakeup function. However this function only wakes the >>> error handler if host_busy == host_failed. >>> >> Which just means that one command is still outstanding, and we need to wait >> for it to complete. >> But see below... > > So the root cause of the hang is maybe that the second command never > completes? Maybe host_failed being non zero is blocking something in the > port multiplier code? > >> Hmm. >> I am really not sure about this. > > I wasn't sure either, that is one reason why I posted the patch. > >> 'host_busy' indicates the number of outstanding commands, and >> 'host_failed' is the number of commands which have failed (on the ground >> that failed commands are considered outstanding, too). >> >> So the first hunk would change the behaviour from 'start SCSI EH once all >> commands are completed or failed' to 'start SCSI EH for _any_ command if >> scsi_eh_wakeup is called' >> (note that shost_failed might be '0'...). >> Which doesn't sound right. > > So could the patch create any problems by starting the EH any time > scsi_eh_wakeup is called? Or is it is just inefficient? > SCSI EH _relies_ on the fact that no other commands are outstanding on that SCSI host, hence the contents of eh_entry list won't change. Your patch breaks this assumption, causing some I/O to be lost. >> I guess this needs further debugging to get to the bottom of it. > > Any suggestions on things I could try? > > The fact that the problem goes away when I only enable one CPU core makes > me think there is a race happening somewhere. > Not sure here. You're effectively creating an endless loop with your patch, assuming that each ioctl will be However, you are effectively creating an endless loop with you testcase, assuming that 'ioctl' finishes all I/O before returning. Which _actually_ is not a requirement; the I/O itself needs to be finished by the time the ioctl returns (obviously), but the _structures_ associated with the ioctl might linger on a bit longer (delayed freeing and whatnot). Yet this is a bit far-fetched, and definitely needs some more analysis. For debugging I would suggest looking at the lifetime of each scsi command, figuring out if by the time the ioctl returns the scsi command is indeed freed up. Also you might want to play around with the 'usleep' a bit; my assumption is that at one point for a large enough wait the problem goes away. (And, incidentally, we might actually getting more than one pending commands if the sleep is small enough; but this is just conjecture :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html