RE: [RFC] Megaraid update, submission

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andre,
Tuesday, May 16, 2006 4:47 PM, Andre Hedrick wrote:
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
If this issue still exist on 2.6.17-rcl kernel, I would definitely work on it.
>From my best estimate, the _NULL pointer_ issue should not be there with the patch.
Please let me know if you still see the issue.

I thank you very much your contribution on the driver stability.

Regards, 

> -----Original Message-----
> From: Andre Hedrick [mailto:andre@xxxxxxxxxxxxx] 
> Sent: Tuesday, May 16, 2006 4:47 PM
> To: Ju, Seokmann
> Cc: linux-scsi@xxxxxxxxxxxxxxx; Andrew Morton; James 
> Bottomley; Christoph Hellwig; Mukker, Atul
> Subject: RE: [RFC] Megaraid update, submission
> 
> 
> Warning OOPS in message, ignore if you hate reading pasted OOPS's
> 
> Seokmann,
> 
> So there should be no (sane) heroic attempts to recover the 
> card state?
> Please look and see the path is only retried and follows the original
> operational path which resulted in setting the 
> 'raid_dev->hw_error' flag.
> If I am reading the code correctly, the *->quiescent flag 
> controls command
> submission to the card.  Thus all commands submitted to the 
> firmware are
> owned by the card, and should be allowed to complete the IO's 
> regardless?
> With as many as 20 requests outstanding (max I have seen to date) and
> termiation of the transactions surely blows apart any filesystem, as I
> have had filesystems and in several cases attached arrays 
> just vaporize if
> forced to reboot when 'hw_error' is set.
> 
> So since the pci_master_abort for the card is being rejected ...
> 
> Lets move on to the list management issues where timeouts on 
> ioctl calls
> have produced NULL pointers when one performs an add v/s move 
> to transfer
> ownership of a given scb between pools.
> 
> Fixing the list management may mean the pci_master_abort is 
> not needed.
> 
> The NULL pointer:
> 
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464723 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464723:40[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464744 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464744:12[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464745 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 
> 464745:23[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464746 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464746:0[255:0], fw owner
> Mar 29 00:09:53 5000 kernel: megaraid: aborting-464747 cmd=2a 
> <c=1 t=0 l=0>
> Mar 29 00:09:53 5000 kernel: megaraid abort: 464747[255:0], 
> driver owner 
> Mar 29 00:09:53 5000 kernel: megaraid: reseting the host...
> Mar 29 00:09:53 5000 kernel: megaraid: 
> 464723:128[65535:65535], reset from pending list
> Mar 29 00:09:53 5000 kernel: megaraid: 4 outstanding 
> commands. Max wait 180 sec
> Mar 29 00:09:53 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:180
> ...
> Mar 29 00:11:54 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:60
> Mar 29 00:11:59 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:55
> Mar 29 00:12:04 5000 kernel: megaraid mbox: Wait for 4 
> commands to complete:50
> Mar 29 00:12:08 5000 kernel: megaraid mbox: reset sequence 
> completed sucessfully
> Mar 29 00:12:08 5000 kernel: Unable to handle kernel NULL 
> pointer dereference at virtual address 00000000
> Mar 29 00:12:08 5000 kernel:  printing eip:
> Mar 29 00:12:08 5000 kernel: f881f739
> Mar 29 00:12:08 5000 kernel: *pde = 00000000
> Mar 29 00:12:08 5000 kernel: Oops: 0002 [#1]
> Mar 29 00:12:08 5000 kernel: SMP
> Mar 29 00:12:08 5000 kernel: Modules linked in: xfs md5 ipv6 
> af_packet button thermal processor fan ac battery tsdev joydev
> evdev usbkbd usbhid e1000 intel_agp agpgart ehci_hcd uhci_hcd 
> usbcore rtc ext3 jbd sd_mod megaraid_mbox megaraid_mm 
> ata_piix  libata scsi_mod
> Mar 29 00:12:08 5000 kernel: CPU:    0
> Mar 29 00:12:08 5000 kernel: EIP:    
> 0060:[pg0+943802169/1069495296] Tainted: P      VLI
> Mar 29 00:12:08 5000 kernel: EIP:    0060:[<f881f739>]    
> Tainted: P VLI
> Mar 29 00:12:08 5000 kernel: EFLAGS: 00010046   (2.6.10)
> Mar 29 00:12:08 5000 kernel: EIP is at 
> megaraid_mbox_build_cmd+0x979/0xce0 [megaraid_mbox]
> Mar 29 00:12:08 5000 kernel: eax: 00000000   ebx: 00000000   
> ecx: 0000000d edx: 79473000
> Mar 29 00:12:08 5000 kernel: esi: c238f780   edi: c23af800   
> ebp: f7491f10 esp: f7491e98
> Mar 29 00:12:09 5000 kernel: ds: 007b   es: 007b   ss: 0068
> Mar 29 00:12:09 5000 kernel: Process scsi_eh_1 (pid: 885, 
> threadinfo=f7490000 task=f7dde020)
> Mar 29 00:12:09 5000 kernel: Stack: c23e3c00 f7de3000 
> f7491ebc f66fc2a0 c23e3c00 0000000d c226a42c f7436038
> Mar 29 00:12:09 5000 kernel:        f7436030 f7491ee8 
> c23b1010 f7491ed0 011d2df4 c226aa34 c226aa2c c226a42c
> Mar 29 00:12:09 5000 kernel:        00000000 000000ff 
> c2268000 6e616373 676e696e 00000000 00000086 70696b73
> Mar 29 00:12:09 5000 kernel: Call Trace:
> Mar 29 00:12:09 5000 kernel:  [show_stack+171/192] 
> show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [<c0103e9b>] show_stack+0xab/0xc0
> Mar 29 00:12:09 5000 kernel:  [show_registers+351/464] 
> show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [<c010402f>] show_registers+0x15f/0x1d0
> Mar 29 00:12:09 5000 kernel:  [die+244/400] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [<c0104244>] die+0xf4/0x190
> Mar 29 00:12:09 5000 kernel:  [do_page_fault+1172/1715] 
> do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [<c0117394>] do_page_fault+0x494/0x6b3
> Mar 29 00:12:09 5000 kernel:  [error_code+43/48] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [<c0103aeb>] error_code+0x2b/0x30
> Mar 29 00:12:09 5000 kernel:  [pg0+943799680/1069495296] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [<f881ed80>] 
> megaraid_queue_command+0x50/0x90 [megaraid_mbox]
> Mar 29 00:12:09 5000 kernel:  [pg0+943941731/1069495296] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8841863>] 
> scsi_dispatch_cmd+0x173/0x290 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943966809/1069495296] 
> scsi_request_fn+0x1e9/0x430 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [blk_run_queue+42/64] 
> blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [<c023aeaa>] blk_run_queue+0x2a/0x40
> Mar 29 00:12:09 5000 kernel:  [pg0+943963243/1069495296] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846c6b>] 
> scsi_run_host_queues+0x2b/0x50 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [pg0+943960213/1069495296] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [<f8846095>] 
> scsi_error_handler+0x85/0x170 [scsi_mod]
> Mar 29 00:12:09 5000 kernel:  [kernel_thread_helper+5/16] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel:  [<c01012d5>] 
> kernel_thread_helper+0x5/0x10
> Mar 29 00:12:09 5000 kernel: Code: 2c 82 f8 c7 47 20 01 00 00 
> 00 8b 4d 9c 85 c9 74 39 8b 4d 9c 31 db 8d b6 00 00 00 00 8d 
> bf 00 00 00 00 8b 55 a0 8b 42 10 8b 56 08 <89> 14 18 31 d2 89 
> 54 18 04 8b 45 a0 8b 50 10 8b 46 0c 83 c6 10
> Mar 29 00:14:23 5000 kernel:  <4>megaraid cmm: ioctl timed out
> Mar 29 00:14:23 5000 kernel: megaraid cmm: controller cannot 
> accept cmds
> due to earlier errors
> Mar 29 00:14:24 5000 last message repeated 3 times
> ...
> until reboot
> 
> I know everyone will rant about ... there is a taint, I just do not
> have immediate access to the logs (which) do exist without the taint
> marker set.
> 
> I will post the patch on kernel.org and can be adopted or dumped.
> The posting to the list was to follow the patch submission rules.
> 
> Cheers,
> 
> Andre Hedrick
> LAD Storage Consulting Group
> 
> On Tue, 16 May 2006, Ju, Seokmann wrote:
> 
> > Hi,
> > 
> > I cannot agree on the changes in the patch for following reasons.
> > 
> > On Tuesday, May 16, 2006 1:44 PM, Andre Hedrick wrote:
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > If 'raid_dev->hw_error' is non-zero, this means that the 
> controller has gone bad and will (and should not to avoid 
> further memory corruption) not be able to recoverd unless reboot.
> > The overall issue described here already taken care by the 
> patch that I've submitted.
> > The patch has been accepted and should be available on 
> 2.6.17-rc1-mm3 as specified in Andrew Morton's email.
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > I'm not sure how this changes help for the issue. 
> Furthermore, I'm not sure what is _the NULL point issue_ 
> refering to. If you see the issue with driver available on 
> 2.6.17-rc1-mm3, please let me know.
> > Following link will leads you to further details of the patch.
> > 
> http://www.kernel.org/git/?p=linux/kernel/git/jejb/scsi-rc-fix
es-2.6.git;a=commit;h=c005fb4fb2d23ba29ad21dee5042b2f8451ca8ba
> > 
> > Thank you,
> > 
> > Seokmann
> > 
> > > -----Original Message-----
> > > From: Andre Hedrick [mailto:andre@xxxxxxxxxxxxx] 
> > > Sent: Tuesday, May 16, 2006 1:44 PM
> > > To: linux-scsi@xxxxxxxxxxxxxxx; Ju, Seokmann; Andrew Morton
> > > Cc: James Bottomley; Christoph Hellwig; Mukker, Atul
> > > Subject: [RFC] Megaraid update, submission
> > > 
> > > 
> > > Linux-scsi, et al.
> > > 
> > > The follow patch address two major issues found under 
> > > extensive testing.
> > > 
> > > While pounding data io down the card and performing large 
> > > scale queries to
> > > the controller about device state and function parameters, 
> > > the following
> > > were discovered.
> > > 
> > > Random (hard to reproduce, without a noise injection into the SATA
> > > connector or cable) hardware error states which locks the 
> > > card and in the
> > > majority of the cases caused the array to be lost.  If the 
> > > array was not
> > > lost then a drive was failed but one could not 
> remove/replace w/ a new
> > > drive.  Thus adding in a pci_master_abort test and clear 
> > > function proved
> > > to allow recovery in all cases where the card shutdown 
> > > communication to
> > > the host.  This may not address all cases; however, 
> clearly this is a
> > > missing part of the driver base when entry to eh_scsi_* begins.
> > > 
> > > The compond issue in the failed recovery resulted in a deref 
> > > NULL pointer
> > > in the various list_head calls.  After change the individual 
> > > list_add to
> > > list_move and such, the NULL point issue has never shown up 
> > > in the past 6
> > > weeks of heavy testing.
> > > 
> > > In all cases in the past, the baseline for error was 6:1.  
> > > Meaning either
> > > one system in six failed and/or one in six test/stress runs 
> > > failed.  With
> > > the attached changes, there have been zero failures in 
> the past three
> > > weeks.  This sound great, but I wish it would fail to allow some
> > > statistics of improved error handling.
> > > 
> > > Please note the changes to SAS are minor and not tested, but 
> > > seem correct
> > > for the entire directory code base.  SAS shares the CMM core 
> > > with MBOX,
> > > thus the rational for changes to SAS.
> > > 
> > > Please comment and provide suggestions.
> > > 
> > > Cheers,
> > > 
> > > Andre Hedrick
> > > LAD Storage Consulting Group
> > > 
> > > 
> > > 
> > > 
> > -
> > : send the line "unsubscribe 
> linux-scsi" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> 
> 
-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux