Ahhh. Now I see what you mean. I thought you were trying to fail the HBA connection over to write to both disks, ie if HBA A goes down have it use HBA B to write to /dev/sdc. I see you just meant to handle it the way a single SCSI card system would. My density never ceases to amaze me sometimes... Glad you got it worked out...
Drew
-----Original Message-----
From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
Sent: Wednesday, February 04, 2004 1:25 PM
To: redhat-list@xxxxxxxxxx
Subject: Re: lpfc RAID1 device panics when one device goes away
The configuration I'm using is:
LUN A -- Storage Processor A -- Fibre Channel Switch A -- HBA A -- /dev/sdc
LUN B -- Storage Processor B -- Fibre Channel Switch B -- HBA B -- /dev/sde
Two complete SCSI paths, one to each LUN (Disk). Regardless of which component
fails in either path the system "sees" the SCSI disk as failed and continues
I/O to the other device specified in the /etc/raidtab and/or /etc/mdadm.conf.
I seem to have fixed the panic problem by adjusting some of the lpfc driver
tunables to reduce the number of outstanding I/O requests and zero the delay
to report errors upward to the SCSI driver. I'm convinced that what was
happening was that when I disabled one switch port to simulate a path failure
the timeout and/or number of outstanding I/O requests eventually caused a
flood of failures to the SCSI driver which overflowed somewhere causing the
panic. I'm still working on the tuning to make sure I haven't inadvertently
caused poor I/O performance while fixing the problem.
-Mark
Hamilton Andrew wrote:
> Are we talking about a failure of one of the HBA's or a failure of a
> drive? I thought we were talking about the HBA failing which is far
> different than a drive failing.
>
> I agree with the LUNS part. That is exactly the way I see it as well.
> However in your case you have 2 connections to the 2 LUNs. In a locally
> attached you have, typically, one scsi connection to the raid array, not
> two. Granted, I would think that it wouldn't work any different from
> the raid point of view if you had 1 SCSI connection or 2, but I assume
> that if you were using two connections to run your array and one of them
> went down how would the system know how to handle a lost SCSI card?
> Panic in my experience. I know there are hardware raid solutions that
> will fail over if one of the raid controller fails. I also know that
> there are software solutions. But I think you have to have an external
> software piece to do it. The kernel/OS isn't going to know by default
> how to "fail over" the connection. If you had a drive fail that's
> different. The software raid knows how to handle a drive failure.
> Handling a drive failure is fairly standard. If you have a backup then
> it just moves to that. But I think handling a SCSI failure would be
> very configuration dependent and would under normal circumstances cause
> a panic. Unless you had something intervening to catch those kinds of
> failures.
>
> I have 1 internal SCSI controller, 1 SCSI card, and 1 HBA. All of them
> act as SCSI controllers. I'm also running software raid on the local
> drives and have a raid on the SAN. If I had a SCSI card failure and the
> SCSI card and the local SCSI controller were talking to the local raid,
> how would the machine react? I wouldn't even know how to tell it to
> fail over to the SAN raid or use the other SCSI card to talk to the raid
> and ignore the failure without some sort of intervening software.
>
> Drew
>
> -----Original Message-----
> From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> Sent: Friday, January 30, 2004 9:52 AM
> To: redhat-list@xxxxxxxxxx
> Subject: Re: lpfc RAID1 device panics when one device goes away
>
>
> Actually I view the configuration as identical to having two locally
> attached SCSI disks which are mirrored via software RAID1. The only
> difference being the two "drives" (LUNs) are located on a storage array
> on a SAN. As far as the OS is concerned the two LUNs are just two
> separate SCSI drives. I'm speculating that the lpfc driver does not
> handle or requires tuning parameters to be set to return the failed path
> information back up to the SCSI driver in a manner which won't cause a
> panic.
> -Mark
>
> Hamilton Andrew wrote:
> > Mark,
> >
> > I may be wrong here and maybe someone out there knows better, but I
> > don't think this will work without PowerPath. That allows your OS to
> > treat both your HBA's as one. And it load balances across the two
> > HBA's. Without that you have two independent connections to two LUNs
> > and that is what is causing the panic. You need something that will
> > treat both your connections as one connection. Even if both your HBA's
> > can talk to both LUNs the OS is not going to fail over to the one that
> > is working without some sort of go-between, and the kernel does not know
> > it can talk to both LUNs via either HBA. It just knows that it had 2
> > connections to the raid and one of them is gone so the raid is no longer
> > available. At least that is the way it would seem to work to me.
> >
> > My 2 cents. Let me know if you find out something different though.
> >
> > Drew
> >
> > -----Original Message-----
> > From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> > Sent: Friday, January 30, 2004 8:54 AM
> > To: redhat-list@xxxxxxxxxx
> > Subject: Re: lpfc RAID1 device panics when one device goes away
> >
> >
> > No, it worked once but then on the next test panic'd again, I'll keep
> > looking.
> > -Mark
> >
> > Hamilton Andrew wrote:
> > > Did that fix it? I have an EMC CX600 configured much the same
> way, but
> > > I'm using RHEL 2.1AS instead of 3.0. I'm sure there are a ton of
> > > differences between the two distro's.
> > >
> > > -----Original Message-----
> > > From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> > > Sent: Wednesday, January 28, 2004 7:09 PM
> > > To: redhat-list@xxxxxxxxxx
> > > Subject: Re: lpfc RAID1 device panics when one device goes away
> > >
> > >
> > > I think I have fixed this by changing the partition type of each
> LUN's
> > > (disk)
> > > partition to "fd" (Linux raid auto).
> > >
> > > Bruen, Mark wrote:
> > > > That will be the config once Veritas and/or EMC support HBA path
> > > > failover on RedHat AS 3.0. Veritas will support it with DMP in
> > version 4
> > > > due in Q2/04, EMC has not committed to a date yet with PowerPath.
> > In the
> > > > interim I'm trying to provide path failover using software RAID1
> > of two
> > > > hardware RAID5 LUNs one on each path (two switches connected to
> two
> > > > storage processors connected to two HBAs per server).
> > > > -Mark
> > > >
> > > > Hamilton Andrew wrote:
> > > >
> > > >> What's your SAN? Why don't you configure your raid1 on the
> SAN and
> > > >> let it publish that raid group as 1 LUN? Are you using a any
> > kind of
> > > >> fibre switch between your cards and your SAN?
> > > >>
> > > >> Drew
> > > >>
> > > >> -----Original Message-----
> > > >> From: Bruen, Mark [mailto:mbruen@xxxxxxxxxxxxxx]
> > > >> Sent: Wednesday, January 28, 2004 3:28 PM
> > > >> To: redhat-list@xxxxxxxxxx
> > > >> Subject: lpfc RAID1 device panics when one device goes away
> > > >>
> > > >>
> > > >> I'm running RedHat AS 3.0 kernel 2.4.21-4.ELsmp on a Dell 1750
> > with 2
> > > >> Emulex
> > > >> LP9002DC-E HBAs. I've configured a RAID1 device called /dev/md10
> > from
> > > >> 2 SAN
> > > >> based LUNs /dev/sdc and /dev/sde. Everything works fine until I
> > > >> disable one of
> > > >> the HBA paths to the disk. Here's the console output:
> > > >> [root@reacher root]# !lpfc1:1306:LKe:Link Down Event received
> > Data: x2
> > > >> x2 x0 x20
> > > >> I/O error: dev 08:40, sector 69792
> > > >> raid1: Disk failure on sde, disabling device.
> > > >> Operation continuing on 1 devices
> > > >> md10: vno@ pspar2e! d?i@
> > > >> s@kq tAo rec@oqnAst`rIu/Oc
> > > >> t
> AaqArra@qyA!@
> > > >> -v-@ cpont
> > > >> inI/uOinhgr oihn de_g_r_a_m@vqA@`@ 70288
> > > >> I/O error: dev 08`I/O sector 70536
> > > >> I/O error: dev 08:40, sector 70784
> > > >> I/O error: dev 08:40, sector 71032
> > > >> I/O error: dev 08:40, sector 71280
> > > >> I/O error@qA@v@p2!?@
> > > >> AqA@qA`I/O
> > > >> BqA@qA@v@p I/Oh 7h____mv@`dev
> 08:40,
> > > >> sector 72024
> > > >> `I/Oerror: dev 08:40, sector 72272
> > > >> I/O error: dev 08:40, sector 72520
> > > >> I/O error: dev 08:40, sector 72768
> > > >> I/O error: dev 08:40, sector 73@qA@v@p2!?@
> > > >> BqA@qA`I/O
> > > >> CqA@qA@v@p
> > > >> I/Ohdeh____mv@`2
> > > >> I/O error: dev 08:40, `I/Oor 73760
> > > >> I/O error: dev 08:40, sector 74008
> > > >> I/O error: dev 08:40, sector 74256
> > > >> I/O error: dev 08:40, sector 74504
> > > >> I/O error: dev@qA@v@p2!?@
> > > >> CqA@qA`I/O
> > > >> DqA@qA@v@p I/Oh0
> > > >> h____mv@`8:40, sector 75248
> > > >> I/O e`I/O: dev 08:40, sector 75496
> > > >> I/O error: dev 08:40, sector 75744
> > > >> I/O error: dev 08:40, sector 75992
> > > >> I/O error: dev 08:40, sector 76240
> > > >> <@qA@v@p2!?@
> > > >> DqA@qA`I/O
> > > >> EqA@qA@v@p I/Oh8:h____mv@` I/O error: dev
> > 08:40,
> > > >> secto`I/O984
> > > >> I/O error: dev 08:40, sector 77232
> > > >> I/O error: dev 08:40, sector 77480
> > > >> I/O error: dev 08:40, sector 77728
> > > >> I/O error: dev 08:4@qA@v@p2!?@
> > > >> EqA@qA`I/O
> > > >> FqA@qA@v@p I/Oh
> Ih____mv@`
> > > >> sector 78352
> > > >> I/O error:`I/O 08:40, sector 78600
> > > >> I/O error: dev 08:40, sector 78848
> > > >> I/O error: dev 08:40, sector 79096
> > > >> I/O error: dev 08:40, sector 79344
> > > >> I/@qA@v@p2!?@
> > > >> FqA@qA`I/O
> > > >> GqA@qA@v@p I/Oh sh____mv@`error: dev
> 08:40,
> > > >> sector
> > > >> 800`I/O4> I/O error: dev 08:40, sector 80336
> > > >> I/O error: dev 08:40, sector 80584
> > > >> I/O error: dev 08:40, sector 80832
> > > >> I/O error: dev 08:40, se@qA@v@p2!?@
> > > >> GqA@qA`I/O
> > > >> HqA@qA@v@p
> > > >> I/Oherh____mv@`or 81576
> > > >> I/O error: dev `I/O0, sector 81824
> > > >> I/O error: dev 08:40, sector 82072
> > > >> I/O error: dev 08:40, sector 82320
> > > >> I/O error: dev 08:40, sector 82568
> > > >> I/O err@qA@v@p2!?@
> > > >> HqA@qA`I/O
> > > >> IqA@qA@v@p I/Ohorh____mv@`: dev
> 08:40,
> > > >> sector 83312
> > > >> <4`I/OO error: dev 08:40, sector 83560
> > > >> I/O error: dev 08:40, sector 83808
> > > >> I/O error: dev 08:40, sector 84056
> > > >> Unable to handle kernel paging request at virtual address
> a0fb8488
> > > >> printing eip:
> > > >> c011f694
> > > >> *pde = 00000000
> > > >> Oops: 0000
> > > >> lp parport autofs tg3 floppy microcode keybdev mousedev hid input
> > > >> usb-ohci
> > > >> usbcore ext3 jbd raid1 raid0 lpfcdd mptscsih mptbase sd_mod
> scsi_mod
> > > >> CPU: -1041286984
> > > >> EIP: 0060:[<c011f694>] Not tainted
> > > >> EFLAGS: 00010087
> > > >>
> > > >> EIP is at do_page_fault [kernel] 0x54 (2.4.21-4.ELsmp)
> > > >> eax: f55ac544 ebx: f55ac544 ecx: a0fb8488 edx: e0b3c000
> > > >> esi: c1ef4000 edi: c011f640 ebp: 000000f0 esp: c1ef40c0
> > > >> ds: 0068 es: 0068 ss: 0068
> > > >> Process Dmu (pid: 0, stackpage=c1ef3000)
> > > >> Stack: 00000000 00000002 022c1008 c1eeee4c c1eff274 00000000
> > 00000000
> > > >> a0fb8488
> > > >> c17c4520 f58903f4 00000000 c1efd764 c1eee5fc f7fe53c4
> > 00030001
> > > >> 00000000
> > > >> 00000002 022c100c c1efd780 c1eeba44 00000000 00000000
> > 00000003
> > > >> c1b968ec
> > > >> Call Trace: [<c011f640>] do_page_fault [kernel] 0x0
> (0xc1ef4178)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef419c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef41b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4278)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef429c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef42b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4378)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef439c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef43b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4478)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef449c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef44b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4578)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef459c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef45b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4678)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef469c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef46b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4778)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef479c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef47b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4878)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef489c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef48b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4978)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef499c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef49b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4ab4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4bb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4cb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4db4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4eb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4fb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5078)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef509c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef50b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5178)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef519c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef51b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5278)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef529c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef52b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5378)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef539c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef53b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5478)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef549c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef54b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5578)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef559c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef55b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5678)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef569c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef56b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5778)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef579c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef57b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5878)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef589c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef58b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5978)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef599c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef59b4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5ab4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5bb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5cb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5db4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5eb4)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f78)
> > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f9c)
> > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5fb4)
> > > >>
> > > >> Code: 8b 82 88 c4 47 c0 8b ba 84 c4 47 c0 01 f8 85 c0 0f 85 46 01
> > > >>
> > > >> Kernel panic: Fatal exception
> > > >>
> > > >> Any Ideas?
> > > >> Thanks.
> > > >> -Mark
> > > >>
> > > >>
> > > >> --
> > > >> redhat-list mailing list
> > > >> unsubscribe
> > mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
> > > >> https://www.redhat.com/mailman/listinfo/redhat-list
>
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list