Re: QLogicPTI Hangs on SPARC64

Julian Calaby <julian.calaby@xxxxxxxxx> · Mon, 8 Aug 2016 19:15:29 +1000

Hi Alex,

On Mon, Aug 8, 2016 at 2:23 PM, Alex McWhirter <alexmcwhirter@xxxxxxxxxx> wrote:
> This is a bug i've been playing with for a while now, but i think i've
> narrowed it down about as far as i can without additional help. I
> belive we are hitting this bit of code, as i have seen the message
> before in previous panics.

You might get more help from the linux-scsi list (CC'd)

>
> toss_command:
>         printk(KERN_EMERG "qlogicpti%d: request queue overflow\n",
>                qpti->qpti_id);
>
>         /* Unfortunately, unless you use the new EH code, which
>          * we don't, the midlayer will ignore the return value,
>          * which is insane.  We pick up the pieces like this.
>          */
>         Cmnd->result = DID_BUS_BUSY;
>         done(Cmnd);
>         return 1;
>
> Correct me if i'm wrong, but i don't how we're pickuping up any peices
> here. Something went wrong and SCSI requests built up to an
> unmanageable point and we just say the bus is busy? Granted i'm not
> really sure how you would pick up any peices in that case unless you
> set the bus busy before the queue were to overflow and just try to wait
> it out.
>
> Take a look at the iostat information below. iostat was configured to
> refresh every second, the bottom was cut off during the panic.
>
> iostat log > http://pastebin.com/ea96AucT
>
> From this you can see that sdc was the first drive to stop responding.
> it's r/s and w/s drop to zero but the util% stays at 100. Shortly
> after, the request queue overflows and sets the whole bus to busy which
> can be seen in the last portion of the log (which didn't finish as the
> system panic'd). All of the disks on that bus have subsequently
> followed suite with sdc because the bus is essnstially screeching to a
> halt.
>
> Below you will find the kernel panic.
>
> kernel panic > http://pastebin.com/n9agfz1z
>
> Again, correct me if i'm wrong, but it would seem that any pointers
> pointing towards the request queue are now invalid as the queue has
> overflown.
>
> Below is just some conjecture on my part.
>
> Should the correct behaviour here not be to fail the disk that is
> holding up the rest of the bus? From what i see, it is quite likely sdc
> is bad so i will be replacing it, however having the whole system panic
> because of a bad disk seems counter intuitive. I realise this is quite
> an old driver, and may have been written before we had ways of dealing
> with these types of issues. Or perhaps even, it's a a hardware
> limitation that prevents up from pinpointing what is acutally no longer
> responding on the bus?
> --
> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Julian Calaby

Email: julian.calaby@xxxxxxxxx
Profile: http://www.google.com/profiles/julian.calaby/
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html