Re: [Scst-devel] New qla2x00tgt Driver Question

Duane Grigsby <duane.grigsby@xxxxxxxxxx> · Sun, 30 Nov 2014 16:00:15 +0000



On 11/26/14, 10:10 AM, "Dr. Greg Wettstein" <greg@xxxxxxxxxxxxxxxxx> wrote:

>On Nov 19, 11:54am, Marc Smith wrote:
>} Subject: [Scst-devel] New qla2x00tgt Driver Question
>
>> Hi,
>
>Hi, I hope the day continues to go well for everyone.
>
>It is -15C here with a -26C windchill so Izzy, my golden retriever and
>I, are answering e-mails rather then putting up outside Christmas
>decorations.
>
>> I've noticed a change in behavior with the new qla2x00tgt
>> driver. With the previous version of this driver, sessions to Fibre
>> Channel switches were deleted, but now they seem to stay around
>> indefinitely (eg, they show up in the list of sessions for a target
>> in SCST).
>
>It seems to be a generic problem with the Qlogic target driver
>architecture.  There really is only one target driver, but
>unfortunately for everyone involved, multiple variants.
>
>I'm including linux-scsi on this since this is as much a problem for
>the mainline kernel storage infrastructure as it is for SCST.
>
>I believe I may have mentioned before that the entire storage
>community, which is dependent on Qlogic hardware, needs to rally
>around a single version of this target architecture... :-)(
Yes, the ³exclusive² option is really not operating in exclusive target
mode.  The driver transitions to target mode from initiator mode, so
there¹s a transition period. One way around this is to use the laser
control attribute. We can turn OFF the laser during initializaton and turn
it back ON after target is up and running.  We will look into this change
and see what can be done.


>
>> I've also noticed this behavior for remote target interfaces if I zone
>> multiple SCST targets together (eg, target interfaces from multiple SCST
>> hosts), and I load the "qla2xxx_scst" module first which puts it into
>> initiator mode initially, and then load qla2x00t which changes it to
>>target
>> mode (qlini_mode=exclusive). On one of the SCST host it sees it
>>initially
>> in initiator mode which establishes a session, and on the remote SCST
>>host
>> its then switched to target mode as initialization progresses, the dead
>> initiator session sticks around in the list of sessions for that target.
>> Its never removed even though the SCST target interface is now only in
>> "target mode".
>> 
>> Its really not hurting anything, but was curious if this is expected
>> behavior, or is it a new/changed option that needs to be set for the
>> qla2xxx_scst module?
>
>I'm not totally convinced that this is harmless behavior given our
>work on Qlogic/Cisco fibre-channel task management stability, see my
>most recent post today.
Interesting, I haven¹t seen that posting. Did you post it to scst-devel or
linux-scsi?

>
>> I see another similar post that may or may not be related:
>> http://sourceforge.net/p/scst/mailman/message/32823494/
>> 
>> That user refers to an error message I do not see in my logs, but some
>>of
>> the other messages in the log are similar to whats below... probably
>> because those messages are common. Figured it was worth mentioning in
>>case
>> they are related.
>> 
>> Linux: 3.14.19 (SCST patches applied)
>> SCST: trunk/r5854
>> qla2x00t_git: 8.07.00.16.Trunk-SCST.15-k (from qla_version.h)
>> 
>> I believe this is the relative section of logs for session
>> "2f:fc:00:c0:dd:0d:b3:6d" (a QLogic Fibre Channel switch):
>> --snip--
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978104] qla2xxx
>> [0000:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978120] [0]:
>> q2t_async_event:6406:qla2x00t(9): Loop up occured
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978171] [0]:
>> q2t_async_event:6430:qla2x00t(9): Port update async event 0x8014 occured
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978188] [0]:
>> q24_handle_els:4893:qla2x00t(9): ELS opcode 3 hndl 0x0 ox_id=0x2a69
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978194] [0]:
>> q24_handle_els:4903:ELS PLOGI rcv 0x2ffc00c0dd0db36d portid=fffc04 hndl
>>0x0
>> Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978201] [0]:
>> q2t_sched_sess_work:1795:Scheduling work (type 5, prm ffff880229c923c0)
>>to
>> find session for param ffff8800a2d1c2c0 (size 64, tgt
>> ffff88011dea41b8)
Yes, it may be related. It has to do with receiving a PLOGI from an
initiator with the same port_id, but a different WWPN. Can you send me
your complete driver log?
 
>
>I was very interested in seeing your mail Marc, since it confirms
>issues we have been tracking and documenting for some time.
>
>What appears to be happening is that the target driver, regardless of
>pedigree, is establishing sessions for the Fibre-Channel Name Services
>(FCNS) directory servers which are running on each switch.  You can
>find the clue in your following debug line from above:
>
>Nov 19 11:10:29 cloverleaf-1a kernel: [ 3386.978194] [0]:
>q24_handle_els:4903:ELS PLOGI rcv 0x2ffc00c0dd0db36d portid=fffc04 hndl
>0x0
>				^^^^^^^^^^^^^
>
>The first two octets, ff:fc, is the 'well-known' address of the FCNS
>name server on each switch.  The octet '04', in your case, is the
>domain id of the switch on which the nameserver is running.
>
>As I noted in my other e-mail this morning, we are running down issues
>with possible task management function (TMF) regressions in a large
>storage fabric.  We believe that occurrences of a TMF are linked to
>catastrophic VMware/VMFS corruption.
>
>Here is a snippet from our logs when we induce a NEXUS_LOSS_SESS event
>in a multi-datacenter/multi-hop Cisco Nexus FCOE fabric:
>
>Sep  3 13:58:31 fc-rp1-targ10-s kernel: qla2xxx [0000:02:00.3]-f842:9:
>Unable to find initiator with S_ID ff:fc:bf
>			   ^^^^^^^^-> NOTE
>
>Sep  3 13:58:32 fc-rp1-targ10-s kernel: qla2xxx [0000:02:00.3]-f820:9:
>Terminating work cmd ffff880818f484d0 <4>qla2xxx [0000:02:00.3]-f810:9:
>qla_target(0): ABTS: Unknown Exchange Address received
>
>Issueing the following command on one of the core 7010 switches:
>
>show fcns internal info vsan 2020
>
>Yields the following:
>
>Info for vsan 2020
>=================
>Interop mode: 0
>R_A_TOV: 10000; D_S_TOV: 5000
>Local Domain: 0x90(144)
>Remote Domains: 0xe(14) 0x45(69) 0x71(113) 0xbf(191)
>
>Which demonstrates, as you can see, that the 'bf' in our case is the
>identity of one of the switches/domains in the fabric.
>
>I omitted the log entries but we also see similar messages for the
>0x71 and 0x45 switches/domains during each event.  The 0xe switch is
>in a third data-center we have projected converged traffic to since
>these tests were run.  I'm sure we will see the 0xe switch included in
>the logs when we resume our testing.
>
>We need to add additional debug code to confirm this but what we think
>is happening is that the target driver is establishing sessions for
>each of the FCNS servers when the target port establishes its fabric
>login.  When an asynchronous event occurs on the fabric, such as a
>NEXUS_LOSS_SESS, each of the FCNS servers issues the following
>sequence of Extended Link Service (ELS) actions against the target:
>
>	PLOGI	(Port login)
>	PRLI	(Process login)
>	ADISC	(Device discovery)
>
>Which I am not convinced is well handled by any of the target driver
>variants or storage stacks which use this infrastructure.  In our
>environment the firmware/card/stack has to handle 3 x
>NUMBER_OF_SWITCHES actions on each event.
>
>One of the interesting things we seem to be seeing is that only the
>Nexus 55xx ingress switches seem to be establishing these sessions.
>We have not documented the storage VDC's on the 7009/7010 core
>switches doing this.  The target server from which the log was
>generated is directly attached to a 7009 on the backbone of this
>fabric.  There is an adjacent 7010 which should, in theory, be doing
>this as well, if our analysis on the root cause of this is correct.
>
>As the log above notes we believe that all of this is possibly
>implicated in abort sequence processing issues.  We have tracked the
>issue with the 'unknown exchange address' back into the IOCB coming
>out of the ISP engine on the Qlogic 8362 FCOE cards which is about as
>far as we can take this without knowing what is going on inside the
>card.
>
>We are also suspicious, as you indicate, that this may having
>something to do with the fabric port on the target coming up in
>iniatiator and then transitioning to target mode.  Obviously the FCNS
>directory servers are involved in tracking the state of the ports.
>One of the things we are going to rigidly document is if this FCNS
>target session establishment can be avoided by only allowing the
>target port login to occur after the interface is broadcasting FCP
>target status.
>
>The open question seems to be how to handle the root cause of this
>issue which is the fact that the target code is granting sessions to
>the FCNS directory servers.  Even if it is harmless it would seem to
>be anomalous behavior which, in the storage business, is one of those
>'nothing good will come of this' type issues.
Hi Greg, Yes, I think there may be a issue there, but let me look at your
logs.  

>
>> Thanks,
>> 
>> Marc
>
>I hope everyone in the business is able to enjoy a pleasant holiday
>weekend, free from SMART anomalies and link drops.
>
>Did I mention the fact that all of this is way too spooky and complex
>to be using three separate variants of the same code....
>
>Greg and Izzy
>
>}-- End of excerpt from Marc Smith
>
>As always,
>Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
>4206 N. 19th Ave.           Specializing in information infra-structure
>Fargo, ND  58102            development.
>PH: 701-281-1686
>FAX: 701-281-3949           EMAIL: greg@xxxxxxxxxxxx
>--------------------------------------------------------------------------
>----
>"Don't talk unless you can improve the silence."
>                                -- George Will

<<attachment: winmail.dat>>