[Bug 187231] New: kernel panic during hpsa MSI plus tg3 MSI

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Mon, 07 Nov 2016 13:53:06 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=187231

            Bug ID: 187231
           Summary: kernel panic during hpsa MSI plus tg3 MSI
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 4.8.6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: SCSI
          Assignee: linux-scsi@xxxxxxxxxxxxxxx
          Reporter: kernelorg@xxxxxx
        Regression: No

Created attachment 243801
  --> https://bugzilla.kernel.org/attachment.cgi?id=243801&action=edit
kernel 4.8.6 .config

I'm not sure whether this is a SCSI / HPSA bug or a networking / tg3 driver
bug. Both are seen in the stack dump. As the trigger seems to be HPSA I'm
reporting as a SCSI issue here...

I've been recently attempting to run mainline 4.8.x kernels, most recently
4.8.6, on our production HP DL 380 Intel servers.

On several of them there is some related issue reported in
https://bugzilla.kernel.org/show_bug.cgi?id=187221 where the HPSA driver on
some of the hosts sometimes resets the logical device. I had seen that already
with 4.4.x kernels, and again with 4.8.6.

Now, specifically with 4.8.6, on the box which has the worst of these symptoms,
I _additionally_ experienced multiple full kernel panics. The same box (with
the same hpsa reset symtoms) had been running 4.4.x kernels before without such
kernel panics. The panics then happened multiple times with about a day in
between.

On the last round I had the ILO SSH console running under screen with logging
enabled, and was able to retrieve the following panic backtrace:

[187283.903173] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical 
Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1   
[187314.331375] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.413441] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.854183] sd 0:1:0:0: rejecting I/O to offline device                     
... lots of these ...
[187328.991285] sd 0:1:0:0: rejecting I/O to offline device                     
[187328.991389] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.190166] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.271304]  ffff88bd1a7e8000 ffff88bd1a7be500 ffff88bd7f483eb8
ffffffff8143
493f                                                                            
[187329.271304] Call Trace:                                                     
[187329.271310]  <IRQ>                                                          
[187329.271310]  [<ffffffffa002e332>] ? tg3_poll_msix+0xc2/0x160 [tg3]          
[187329.271311]  [<ffffffff8143493f>] do_hpsa_intr_msi+0x8f/0x1c0               
[187329.271314]  [<ffffffff81148c46>] __handle_irq_event_percpu+0x66/0xe0       
[187329.271315]  [<ffffffff81148cde>] handle_irq_event_percpu+0x1e/0x50         
[187329.271316]  [<ffffffff81148d37>] handle_irq_event+0x27/0x50                
[187329.271318]  [<ffffffff8114bda5>] handle_edge_irq+0x65/0x140                
[187329.271320]  [<ffffffff81057255>] handle_irq+0x15/0x20                      
[187329.271321]  [<ffffffff81057086>] do_IRQ+0x46/0xd0                          
[187329.271324]  [<ffffffff816dc4fc>] common_interrupt+0x7c/0x7c                
[187329.271325]  <EOI>                                                          
[187329.271338] Code: 53 48 89 fb 48 83 ec 28 4c 8b a7 5c 02 00 00 4c 8b bf 40
0
2 00 00 4c 8b b7 38 02 00 00 4c 8b af 4c 02 00 00 49 8b 04 24 4c 89 e7 <48> 8b
8
0 98 00 00 00 48 89 45 c0 49 8b 87 d0 01 00 00 48 89 45                         
[187329.271339] RIP  [<ffffffff81431417>] complete_scsi_command+0x37/0x8c0      
[187329.271339]  RSP <ffff88bd7f483e38>                                         
[187329.271339] CR2: 0000000000000098                                           
[187329.271341] ---[ end trace 52898916f0da5c53 ]---                            
[187329.273413] Kernel panic - not syncing: Fatal exception in interrupt        
[187330.308465] Shutting down cpus with NMI                                     
[187330.308471] Kernel Offset: disabled                                         
[187330.919173] Rebooting in 300 seconds..  

I'll attach my kernel .config.

As this is a production system and so far the panics only hit with our usual
(webserver and DB kvm machine) production load active, there's not much testing
or bisecting I can do, but I didn't want to drop the issue unreported, either. 

Hope this helps somebody. If there is any more info I can provide, just ask
what would be useful.

(I'm back to running 4.4.x)

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html