Re: [Scst-devel] Poisoning of Linux initiators on SCST reboot.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



greg@xxxxxxxxxxxx wrote:
Good morning to everyone, hope your respective days are going well.
Sorry for the wide cast on this but I wanted to get what would seem to
be the concerned parties on this issue in the loop.

We have been putting SCST through an extensive round of pre-production
testing.  I wanted to start following up on some of the issues we have
noted.

We will be putting SCST into service to support mirrored storage from
client initiators to two separate data-centers.  The filesystems on
the client initiators access storage at the two data-centers via a
Linux MD RAID1 device.  The SAN architecture is based on Cisco
MDS-9509 switches.

Just as an aside for people considering use of SCST.  The core engine
has been rock solid.  Our testing rounds consists of driving around
1/8 of a petabyte of widely disparate I/O types from multiple
initiators to a pair of targets in the two data-centers.  SCST hasn't
missed a beat so far, so kudos to Vlad and everyone involved in its
development.

Thanks! Your support is very much appreciated and exactly on time, because I'm going to submit SCST patches for review and inclusion into the kernel next week.

As we began forced failure testing one issue has come up that I wanted
to advise people of.  A hard reboot of an SCST target server results
in the 'poisoning' of Linux based initiators.  We verified the issue
as being present on client initiators running the stock RHEL5 kernel
up through 2.6.26.

The targets are using Qlogic 2462 cards using the isp_mod driver.  The
client initiators are using Qlogic 2342 cards with the qla2xxx driver.

Failure mode is as follows:

        1.) Configure SCST based storage for an initiator (vdisk
            based).

        2.) Activate initiator.  Initiator logs into fabric and
            discovers SCST based storage.

        3.) Force SCST target failure by rebooting or pulling power.

        4.) SCST target returns to service and logs into zone.

        5.) Initiator picks up RSCN but re-activates the rport for
            SCST server as an INITIATOR rather than TARGET role.

After this point in time the initiator is effectively 'poisoned'.

Nothing short of unloading and reloading the Qlogic 2xxx driver on the
client initiator will allow the initiator to recognize the SCST server
as a target device.  A driver unload/reload of course is not an option
to restore connectivity since it would take the remaining live side of
the mirror off-line as well.

We finally figured out what seems to be happening by watching the logs
on the client and comparing what was going on there to the FLOGI login
status on the fabric.

When the SCST target server reboots the initiator times out the remote
port and places it into 'unknown' state.  The qla2xxx driver,
according to the source code, maintains the previous rport state in
driver internal data.

The 2462 card in the target on boot logs into the fabric with an
initiator role, I'm assuming in support of BIOS based SAN booting. The
client initiator picks up on this and re-activates the rport as being
in an INITIATOR role.

You should be able to switch off this behavior by disabling the SAN booting in the card's BIOS.

Loading the isp_mod driver causes the 2462 card in the target to be
shutdown.  The client initiator picks up on this and times out the
rport retaining the last rport state as INITIATOR.

Enabling target mode on the 2462 causes it to log back into the
fabric.  The client initiator picks up on the RSCN but refuses to
transition the rport from INITIATOR to TARGET state.  Without going
into TARGET state the remote port won't have SCSI device discovery
initiated against it and hence the SCST based storage is inaccessible.

Activating a LIP on the client initiates a new fabric login attempt
which completes with the following message:

Jul 24 02:53:59 init-test kernel: rport-2:0-0: blocked FC remote port
time out: no longer a FCP target, removing starget

Which from a review of the source code seems consistent with our
analysis of the problem.

The culprit is the following code from drivers/scsi/scsi_transport_fc.c:

        if ((rport->port_state == FC_PORTSTATE_ONLINE) &&
            (rport->scsi_target_id != -1) &&
            !(rport->roles & FC_PORT_ROLE_FCP_TARGET)) {
                dev_printk(KERN_ERR, &rport->dev,
                        "blocked FC remote port time out: no longer"
                        " a FCP target, removing starget\n");
                spin_unlock_irqrestore(shost->host_lock, flags);
                scsi_target_unblock(&rport->dev);
                fc_queue_work(shost, &rport->stgt_delete_work);
                return;
        }

The above gets executed in response to the LIP on the initiator.  The
value in rport->roles is being populated with what the remote target
was INITIATOR rather than its current TARGET state.

Windows client initiators running against the SCST targets get the
transition and login sequence correct.  When the SCST target is
re-activated after the cold boot those clients immediately re-discover
their storage while the Linux clients issue error messages about loss
of the remote target.

While all this doesn't seem to be technically a bug with SCST it
certainly is a problematic usage scenario.  It may also explain why
some individuals may have had problems getting SCST clients to access
their storage.

If a test SCST server was plugged into an active zone and turned on it
would immediately poison any Linux clients.  No amount of proper
configuration on the target would allow the client to access storage
until the client was rebooted or its drivers reloaded.

Any suggestions on how to move forward would be appreciated.  We've
got a pretty extensive test environment and would be happy to test run
any suggested changes or patches.

I've also many times seen how Linux Qlogic qla2xxx driver "lost" remote ports. But that was from the target side and I wasn't able to figure out the exact test case for that. Plus, we found out a suitable for target workaround: usage of INITIATOR PORT NAME field in ATIO IOCB for the lost ports.

So, qla2xxx driver definitely has problem(s) in this area. The fact that Windows works well in this scenario only additionally proves that. But I'm afraid, you have the only way to deal with it is to fix qla2xxx driver itself. My experience with contacts with Andrew Vasquez, the driver's maintainer, that you need something more valuable than problems with some home brewed target to make him interested. Otherwise your questions will be simply ignored.

Once again a thank you to everyone who has contributed to SCST
development.  Other than this and a few additional glitches I will
follow up with via additional e-mails it is presenting itself as a
very solid platform for storage delivery.

Best wishes for a pleasant weekend to everyone.

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@xxxxxxxxxxxx
------------------------------------------------------------------------------
"Much work remains to be done before we can announce our total failure
 to make any progress."
                                -- Mike Kelly

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Scst-devel mailing list
Scst-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/scst-devel


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux