Re: lpfc RAID1 device panics when one device goes away

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The configuration I'm using is:
LUN A -- Storage Processor A -- Fibre Channel Switch A -- HBA A -- /dev/sdc
LUN B -- Storage Processor B -- Fibre Channel Switch B -- HBA B -- /dev/sde

Two complete SCSI paths, one to each LUN (Disk). Regardless of which component fails in either path the system "sees" the SCSI disk as failed and continues I/O to the other device specified in the /etc/raidtab and/or /etc/mdadm.conf.

I seem to have fixed the panic problem by adjusting some of the lpfc driver tunables to reduce the number of outstanding I/O requests and zero the delay to report errors upward to the SCSI driver. I'm convinced that what was happening was that when I disabled one switch port to simulate a path failure the timeout and/or number of outstanding I/O requests eventually caused a flood of failures to the SCSI driver which overflowed somewhere causing the panic. I'm still working on the tuning to make sure I haven't inadvertently caused poor I/O performance while fixing the problem.
-Mark


Hamilton Andrew wrote:
Are we talking about a failure of one of the HBA's or a failure of a drive? I thought we were talking about the HBA failing which is far different than a drive failing.

I agree with the LUNS part. That is exactly the way I see it as well. However in your case you have 2 connections to the 2 LUNs. In a locally attached you have, typically, one scsi connection to the raid array, not two. Granted, I would think that it wouldn't work any different from the raid point of view if you had 1 SCSI connection or 2, but I assume that if you were using two connections to run your array and one of them went down how would the system know how to handle a lost SCSI card? Panic in my experience. I know there are hardware raid solutions that will fail over if one of the raid controller fails. I also know that there are software solutions. But I think you have to have an external software piece to do it. The kernel/OS isn't going to know by default how to "fail over" the connection. If you had a drive fail that's different. The software raid knows how to handle a drive failure. Handling a drive failure is fairly standard. If you have a backup then it just moves to that. But I think handling a SCSI failure would be very configuration dependent and would under normal circumstances cause a panic. Unless you had something intervening to catch those kinds of failures.

I have 1 internal SCSI controller, 1 SCSI card, and 1 HBA. All of them act as SCSI controllers. I'm also running software raid on the local drives and have a raid on the SAN. If I had a SCSI card failure and the SCSI card and the local SCSI controller were talking to the local raid, how would the machine react? I wouldn't even know how to tell it to fail over to the SAN raid or use the other SCSI card to talk to the raid and ignore the failure without some sort of intervening software.

Drew

-----Original Message-----
From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
Sent: Friday, January 30, 2004 9:52 AM
To: redhat-list@xxxxxxxxxx
Subject: Re: lpfc RAID1 device panics when one device goes away


Actually I view the configuration as identical to having two locally attached SCSI disks which are mirrored via software RAID1. The only difference being the two "drives" (LUNs) are located on a storage array on a SAN. As far as the OS is concerned the two LUNs are just two separate SCSI drives. I'm speculating that the lpfc driver does not handle or requires tuning parameters to be set to return the failed path information back up to the SCSI driver in a manner which won't cause a panic. -Mark

Hamilton Andrew wrote:
> Mark,
>
> I may be wrong here and maybe someone out there knows better, but I
> don't think this will work without PowerPath. That allows your OS to
> treat both your HBA's as one. And it load balances across the two
> HBA's. Without that you have two independent connections to two LUNs
> and that is what is causing the panic. You need something that will
> treat both your connections as one connection. Even if both your HBA's
> can talk to both LUNs the OS is not going to fail over to the one that
> is working without some sort of go-between, and the kernel does not know
> it can talk to both LUNs via either HBA. It just knows that it had 2
> connections to the raid and one of them is gone so the raid is no longer
> available. At least that is the way it would seem to work to me.
>
> My 2 cents. Let me know if you find out something different though.
>
> Drew
>
> -----Original Message-----
> From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> Sent: Friday, January 30, 2004 8:54 AM
> To: redhat-list@xxxxxxxxxx
> Subject: Re: lpfc RAID1 device panics when one device goes away
>
>
> No, it worked once but then on the next test panic'd again, I'll keep
> looking.
> -Mark
>
> Hamilton Andrew wrote:
> > Did that fix it? I have an EMC CX600 configured much the same way, but
> > I'm using RHEL 2.1AS instead of 3.0. I'm sure there are a ton of
> > differences between the two distro's.
> >
> > -----Original Message-----
> > From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> > Sent: Wednesday, January 28, 2004 7:09 PM
> > To: redhat-list@xxxxxxxxxx
> > Subject: Re: lpfc RAID1 device panics when one device goes away
> >
> >
> > I think I have fixed this by changing the partition type of each LUN's
> > (disk)
> > partition to "fd" (Linux raid auto).
> >
> > Bruen, Mark wrote:
> > > That will be the config once Veritas and/or EMC support HBA path
> > > failover on RedHat AS 3.0. Veritas will support it with DMP in
> version 4
> > > due in Q2/04, EMC has not committed to a date yet with PowerPath.
> In the
> > > interim I'm trying to provide path failover using software RAID1
> of two
> > > hardware RAID5 LUNs one on each path (two switches connected to two
> > > storage processors connected to two HBAs per server).
> > > -Mark
> > >
> > > Hamilton Andrew wrote:
> > >
> > >> What's your SAN? Why don't you configure your raid1 on the SAN and
> > >> let it publish that raid group as 1 LUN? Are you using a any
> kind of
> > >> fibre switch between your cards and your SAN?
> > >>
> > >> Drew
> > >>
> > >> -----Original Message-----
> > >> From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> > >> Sent: Wednesday, January 28, 2004 3:28 PM
> > >> To: redhat-list@xxxxxxxxxx
> > >> Subject: lpfc RAID1 device panics when one device goes away
> > >>
> > >>
> > >> I'm running RedHat AS 3.0 kernel 2.4.21-4.ELsmp on a Dell 1750
> with 2
> > >> Emulex
> > >> LP9002DC-E HBAs. I've configured a RAID1 device called /dev/md10
> from
> > >> 2 SAN
> > >> based LUNs /dev/sdc and /dev/sde. Everything works fine until I
> > >> disable one of
> > >> the HBA paths to the disk. Here's the console output:
> > >> [root@reacher root]# !lpfc1:1306:LKe:Link Down Event received
> Data: x2
> > >> x2 x0 x20
> > >> I/O error: dev 08:40, sector 69792
> > >> raid1: Disk failure on sde, disabling device.
> > >> Operation continuing on 1 devices
> > >> md10: vno@ pspar2e! d?i@
> > >> s@kq tAo rec@oqnAst`rIu/Oc
> > >> t AaqArra@qyA!@
> > >> -v-@ cpont
> > >> inI/uOinhgr oihn de_g_r_a_m@vqA@`@ 70288
> > >> I/O error: dev 08`I/O sector 70536
> > >> I/O error: dev 08:40, sector 70784
> > >> I/O error: dev 08:40, sector 71032
> > >> I/O error: dev 08:40, sector 71280
> > >> I/O error@qA@v@p2!?@
> > >> AqA@qA`I/O
> > >> BqA@qA@v@p I/Oh 7h____mv@`dev 08:40,
> > >> sector 72024
> > >> `I/Oerror: dev 08:40, sector 72272
> > >> I/O error: dev 08:40, sector 72520
> > >> I/O error: dev 08:40, sector 72768
> > >> I/O error: dev 08:40, sector 73@qA@v@p2!?@
> > >> BqA@qA`I/O
> > >> CqA@qA@v@p
> > >> I/Ohdeh____mv@`2
> > >> I/O error: dev 08:40, `I/Oor 73760
> > >> I/O error: dev 08:40, sector 74008
> > >> I/O error: dev 08:40, sector 74256
> > >> I/O error: dev 08:40, sector 74504
> > >> I/O error: dev@qA@v@p2!?@
> > >> CqA@qA`I/O
> > >> DqA@qA@v@p I/Oh0
> > >> h____mv@`8:40, sector 75248
> > >> I/O e`I/O: dev 08:40, sector 75496
> > >> I/O error: dev 08:40, sector 75744
> > >> I/O error: dev 08:40, sector 75992
> > >> I/O error: dev 08:40, sector 76240
> > >> <@qA@v@p2!?@
> > >> DqA@qA`I/O
> > >> EqA@qA@v@p I/Oh8:h____mv@` I/O error: dev
> 08:40,
> > >> secto`I/O984
> > >> I/O error: dev 08:40, sector 77232
> > >> I/O error: dev 08:40, sector 77480
> > >> I/O error: dev 08:40, sector 77728
> > >> I/O error: dev 08:4@qA@v@p2!?@
> > >> EqA@qA`I/O
> > >> FqA@qA@v@p I/Oh Ih____mv@`
> > >> sector 78352
> > >> I/O error:`I/O 08:40, sector 78600
> > >> I/O error: dev 08:40, sector 78848
> > >> I/O error: dev 08:40, sector 79096
> > >> I/O error: dev 08:40, sector 79344
> > >> I/@qA@v@p2!?@
> > >> FqA@qA`I/O
> > >> GqA@qA@v@p I/Oh sh____mv@`error: dev 08:40,
> > >> sector
> > >> 800`I/O4> I/O error: dev 08:40, sector 80336
> > >> I/O error: dev 08:40, sector 80584
> > >> I/O error: dev 08:40, sector 80832
> > >> I/O error: dev 08:40, se@qA@v@p2!?@
> > >> GqA@qA`I/O
> > >> HqA@qA@v@p
> > >> I/Oherh____mv@`or 81576
> > >> I/O error: dev `I/O0, sector 81824
> > >> I/O error: dev 08:40, sector 82072
> > >> I/O error: dev 08:40, sector 82320
> > >> I/O error: dev 08:40, sector 82568
> > >> I/O err@qA@v@p2!?@
> > >> HqA@qA`I/O
> > >> IqA@qA@v@p I/Ohorh____mv@`: dev 08:40,
> > >> sector 83312
> > >> <4`I/OO error: dev 08:40, sector 83560
> > >> I/O error: dev 08:40, sector 83808
> > >> I/O error: dev 08:40, sector 84056
> > >> Unable to handle kernel paging request at virtual address a0fb8488
> > >> printing eip:
> > >> c011f694
> > >> *pde = 00000000
> > >> Oops: 0000
> > >> lp parport autofs tg3 floppy microcode keybdev mousedev hid input
> > >> usb-ohci
> > >> usbcore ext3 jbd raid1 raid0 lpfcdd mptscsih mptbase sd_mod scsi_mod
> > >> CPU: -1041286984
> > >> EIP: 0060:[<c011f694>] Not tainted
> > >> EFLAGS: 00010087
> > >>
> > >> EIP is at do_page_fault [kernel] 0x54 (2.4.21-4.ELsmp)
> > >> eax: f55ac544 ebx: f55ac544 ecx: a0fb8488 edx: e0b3c000
> > >> esi: c1ef4000 edi: c011f640 ebp: 000000f0 esp: c1ef40c0
> > >> ds: 0068 es: 0068 ss: 0068
> > >> Process Dmu (pid: 0, stackpage=c1ef3000)
> > >> Stack: 00000000 00000002 022c1008 c1eeee4c c1eff274 00000000
> 00000000
> > >> a0fb8488
> > >> c17c4520 f58903f4 00000000 c1efd764 c1eee5fc f7fe53c4
> 00030001
> > >> 00000000
> > >> 00000002 022c100c c1efd780 c1eeba44 00000000 00000000
> 00000003
> > >> c1b968ec
> > >> Call Trace: [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4178)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef419c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef41b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4278)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef429c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef42b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4378)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef439c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef43b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4478)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef449c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef44b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4578)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef459c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef45b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4678)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef469c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef46b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4778)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef479c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef47b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4878)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef489c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef48b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4978)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef499c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef49b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4ab4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4bb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4cb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4db4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4eb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4fb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5078)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef509c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef50b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5178)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef519c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef51b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5278)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef529c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef52b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5378)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef539c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef53b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5478)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef549c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef54b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5578)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef559c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef55b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5678)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef569c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef56b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5778)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef579c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef57b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5878)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef589c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef58b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5978)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef599c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef59b4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5ab4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5bb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5cb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5db4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5eb4)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f78)
> > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f9c)
> > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5fb4)
> > >>
> > >> Code: 8b 82 88 c4 47 c0 8b ba 84 c4 47 c0 01 f8 85 c0 0f 85 46 01
> > >>
> > >> Kernel panic: Fatal exception
> > >>
> > >> Any Ideas?
> > >> Thanks.
> > >> -Mark
> > >>
> > >>
> > >> --
> > >> redhat-list mailing list
> > >> unsubscribe
> mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
> > >> https://www.redhat.com/mailman/listinfo/redhat-list



-- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list



--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux