Re: [PATCH] scsi: qla2xxx: Keep initiator ports after RSCN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 19, 2020 at 10:46:54AM +0200, Martin Wilck wrote:
> On Mon, 2020-05-18 at 21:31 +0300, Roman Bolshakov wrote:
> > The driver performs SCR (state change registration) in all modes
> > including pure target mode.
> > 
> > For each RSCN, scan_needed flag is set in qla2x00_handle_rscn() for
> > the
> > port mentioned in the RSCN and fabric rescan is scheduled. During the
> > rescan, GNN_FT handler, qla24xx_async_gnnft_done() deletes session of
> > the port that caused the RSCN.
> > 
> > In target mode, the session deletion has an impact on ATIO handler,
> > qlt_24xx_atio_pkt(). Target responds with SAM STATUS BUSY to I/O
> > incoming from the deleted session. qlt_handle_cmd_for_atio() and
> > qlt_handle_task_mgmt() return -EFAULT if they are not able to find
> > session of the command/TMF, and that results in invocation of
> > qlt_send_busy():
> > 
> >   qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 0014
> >   qla_target(0): Unable to send command to target, sending BUSY
> > status
> > 
> > Such response causes command timeout on the initiator. Error handler
> > thread on the initiator will be spawned to abort the commands:
> > 
> >   scsi 23:0:0:0: tag#0 abort scheduled
> >   scsi 23:0:0:0: tag#0 aborting command
> >   qla2xxx [0000:af:00.0]-188c:23: Entered qla24xx_abort_command.
> >   qla2xxx [0000:af:00.0]-801c:23: Abort command issued nexus=23:0:0
> > -- 0 2003.
> > 
> > Command abort is rejected by target and fails (2003), error handler
> > then
> > tries to perform DEVICE RESET and TARGET RESET but they're also
> > doomed
> > to fail because TMFs are ignored for the deleted sessions.
> > 
> > Then initiator makes BUS RESET that resets the link via
> > qla2x00_full_login_lip(). BUS RESET succeeds and brings initiator
> > port
> > up, SAN switch detects that and sends RSCN to the target port and it
> > fails again the same way as described above. It never goes out of the
> > loop.
> > 
> > The change breaks the RSCN loop by keeping initiator sessions
> > mentioned
> > in RSCN payload in all modes, including dual and pure target mode.
> > 
> > Fixes: 2037ce49d30a ("scsi: qla2xxx: Fix stale session")
> > Cc: Quinn Tran <qutran@xxxxxxxxxxx>
> > Cc: Arun Easi <aeasi@xxxxxxxxxxx>
> > Cc: Nilesh Javali <njavali@xxxxxxxxxxx>
> > Cc: Bart Van Assche <bvanassche@xxxxxxx>
> > Cc: Daniel Wagner <dwagner@xxxxxxx>
> > Cc: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> > Cc: Martin Wilck <mwilck@xxxxxxxx>
> > Cc: stable@xxxxxxxxxxxxxxx # v5.4+
> > Signed-off-by: Roman Bolshakov <r.bolshakov@xxxxxxxxx>
> > ---
> >  drivers/scsi/qla2xxx/qla_gs.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > Hi Martin,
> > 
> > Please apply the patch to scsi-fixes/5.7 at your earliest
> > convenience.
> > 
> > qla2xxx in target and, likely, dual mode is unusable in some SAN
> > fabrics
> > due to the bug.
> > 
> > Thanks,
> > Roman
> > 
> > diff --git a/drivers/scsi/qla2xxx/qla_gs.c
> > b/drivers/scsi/qla2xxx/qla_gs.c
> > index 42c3ad27f1cb..b9955af5cffe 100644
> > --- a/drivers/scsi/qla2xxx/qla_gs.c
> > +++ b/drivers/scsi/qla2xxx/qla_gs.c
> > @@ -3495,8 +3495,10 @@ void qla24xx_async_gnnft_done(scsi_qla_host_t
> > *vha, srb_t *sp)
> >  			if ((fcport->flags & FCF_FABRIC_DEVICE) == 0) {
> >  				qla2x00_clear_loop_id(fcport);
> >  				fcport->flags |= FCF_FABRIC_DEVICE;
> > -			} else if (fcport->d_id.b24 != rp->id.b24 ||
> > -				fcport->scan_needed) {
> > +			} else if ((fcport->d_id.b24 != rp->id.b24 ||
> > +				    fcport->scan_needed) &&
> > +				   (fcport->port_type != FCT_INITIATOR
> > &&
> > +				    fcport->port_type !=
> > FCT_NVME_INITIATOR)) {
> >  				qlt_schedule_sess_for_deletion(fcport);
> >  			}
> >  			fcport->d_id.b24 = rp->id.b24;
> 
> Hi Roman,
> 
> what if the session does need to be deleted eventually?

Hi Martin,

I'm sorry for the delay, it took a while to complete the answer :)

In the other reply to the patch I sent a proposal for v2. We might go
with it to handle N_Port_ID changes. N_Port_ID may change in the
switched fabric topology if initiator cable is replugged to another
physical port on the SAN switch (some fabrics assign physical port
number to domain area). Physical reconnection means initiator is going
to relogin anyway and previous session is no longer needed.

Third variant would be to check non-NULL TCM session on the fcport instead of
checking against fc_port type but that will be less reliable if qla2xxx_nvmet
is added to the mainline, i.e.:

     } else if (fcport->d_id.b24 != rp->id.b24 ||
                (fcport->scan_needed &&
                 !fcport->se_sess)) {
             qlt_schedule_sess_for_deletion(fcport);
     }

Fourth variant would be to disable SCR in pure target mode as Firmware
Interface Specification suggests. Although, the issue would still exit
in dual mode. Proposal #2 and #3 fix the issue in target and dual modes.

> E.g. if after the RSCN the connection to the initiator is actually
> lost, either temporarily or for good? Would the session be deleted in
> some other code path, or would it just continue to lurk around?
> 

RSCN per-se shouldn't be a reason to delete initiator IMO, since it's
just a notification that state of the entity in the fabric has changed.
Nx_Port should figure out what changed and then decide if it's worth to
keep session of the port.

There are multiple cases in v5.7 when session deletion happens. I tried
to cover them below in cscope order and inspected the code around.
Pardon me for mistakes if I overlooked something:

1.
2.
3.  GPN_ID request may be issued to find if something has changed with a
    port after an RSCN.

    I don't see if it's called anywhere though. There's an request of
    GPN_ID inside GPN_ID response handling but it's never called
    outside. So it looks a kind of dead code:

    qla24xx_post_gpnid_work() sends QLA_EVT_GPNID ->
    qla2x00_do_work() dispatches the event to qla24xx_async_gpnid() ->
    qla24xx_async_gpnid() sends IOCB with GPN_ID to fimrware and sets up
    call back to qla2x00_async_gpnid_sp_done() ->
    qla2x00_async_gpnid_sp_done() calls qla24xx_post_gpnid_work() if the
    IOCB timed out or RSCN arrived in the middle of the request.


    If GPN_ID request fails, all session on the host are deleted in
    qla24xx_handle_gpnid_event(). It's kind of weird that it deletes
    everything. May be it was expected to look like this to delete only
    the port that was requested in the GPN_ID request (ea->id.b24):

      if (ea->rc) {
              /* cable is disconnected */
              list_for_each_entry_safe(fcport, t, &vha->vp_fcports, list) {
                      if (fcport->d_id.b24 == ea->id.b24) {
                              fcport->scan_state = QLA_FCPORT_SCAN;

         		      qlt_schedule_sess_for_deletion(fcport);
         	     }
              }
      } else {

    Instead of (5.7/scsi-fixes):

      if (ea->rc) {
              /* cable is disconnected */
              list_for_each_entry_safe(fcport, t, &vha->vp_fcports, list) {
                      if (fcport->d_id.b24 == ea->id.b24)
                              fcport->scan_state = QLA_FCPORT_SCAN;

                      qlt_schedule_sess_for_deletion(fcport);
              }
      } else {


    If GPN_ID request suceeds, and there's a session for the port with
    the same N_Port_Name, it deletes all sessions on the host in
    qla24xx_handle_gpnid_event(). It looks like too heavy hammer IMO,
    maybe it was expected to behave like this to replace only session
    session with the same N_Port_Name (including deleted sessions):

      list_for_each_entry_safe(conflict, t, &vha->vp_fcports,
          list) {
              if ((conflict->d_id.b24 == ea->id.b24) &&
                  (fcport != conflict)) {
                      /*
                       * 2 fcports with conflict Nport ID or
                       * an existing fcport is having nport ID
                       * conflict with new fcport.
                       */

                      conflict->scan_state = QLA_FCPORT_SCAN;

         	      qlt_schedule_sess_for_deletion(conflict);
              }
      }

    Instead of (5.7/scsi-fixes):

      list_for_each_entry_safe(conflict, t, &vha->vp_fcports,
          list) {
              if ((conflict->d_id.b24 == ea->id.b24) &&
                  (fcport != conflict))
                      /*
                       * 2 fcports with conflict Nport ID or
                       * an existing fcport is having nport ID
                       * conflict with new fcport.
                       */

                      conflict->scan_state = QLA_FCPORT_SCAN;

              qlt_schedule_sess_for_deletion(conflict);
      }

    If GPN_ID request succeeds, no sessions were were found for the
    N_Port_Name but there's a session with conflicting N_Port_ID, it gets
    deleted in qla24xx_handle_gpnid_event().

4.
5.  Session is deleted during fabric rescan in qla24xx_async_gnnft_done()
    if RSCN had N_Port_ID of the session (**The patch aims to address
    the issue**) or N_Port_ID was changed in GPN_FT reply.

    Sessions missing in GPN_FT response are deleted after fabric
    discovery in qla24xx_async_gnnft_done() if driver is running in pure
    initiator or dual mode. So, if zone definition changed or some ports
    logged out, sessions of the ports that somehow disappeared from the
    zone are going to be deleted.

6.
7.  Session is deleted if ADISC to an N_Port fails
    qla24xx_handle_adisc_event(). It's also going to be deleted if
    GPN_ID was done for the port or RSCN for the port arrived during ADISC.

    I'm not sure but it may cause session loss for the target
    mode and then I/O timeout on the attached initiator.

8.
9.  RSCN or fabric rescan during GNL will trigger session deletion in
    qla24xx_handle_gnl_done_event().

10.
11. GPDB handling during login will delete session in
    qla24xx_handle_gpdb_event() if RSCN has happened or GPN_ID was made
    in the middle of the GPDB for the port. If ports are detected by
    GPDB as logged out or logout is pending, their session is also
    deleted.

12. Session is going to be deleted in P2P mode if a spare N_Port handle
    can't be found in qla_chk_n2n_b4_login().

13. Failed PLOGI in all topologies but P2P is going to trigger session
    deletion in qla24xx_fcport_handle_login().

14. Session is deleted after an unsuccessful PRLI in
    qla24xx_handle_prli_done_event(). I wonder if dual FCP/FC-NVMe
    sessions are impacted by that.

15.
16. RSCN during PLOGI or duplicated N_Port handle will trigger session
    deletion in qla24xx_handle_plogi_done_event()

17. Missing target, switch, name server and NVMe sessions are deleted in
    after rescan of arbitrated loop by qla2x00_configure_local_loop() in
    dual and pure initiator mode.

    This check is similar to 5. Code duplication can be removed.

18. An RSCN during port registration is going to trigger deletion of the
    related session in qla_register_fcport_fn().

19. Missing target, switch, name server and NVMe sessions are deleted
    after sync fabric scan by qla2x00_find_all_fabric_devs() in dual and
    pure initiator mode.

    The code is similar to 5 and 17, perhaps the duplication can be
    removed.

20. Session is deleted in qla2x00_els_dcmd2_sp_done() if duplicate
    N_Port handle was detected by firmware during Login IOCB.

21. Session is deleted in initiator mode if firmware detects port logout
    in qla2x00_async_event()

22. Session is deleted in during initiator I/O in qla2x00_status_entry()
    if firmware returns a kind of logged out state.

23. Conflicting session with the same N_Port_ID is deleted in cases when
    login is performed by firmware in qla24xx_report_id_acquisition().

24. In case of link loss, all session are deleted in
    qla2x00_mark_all_devices_lost().

25. Another session with the same N_Port_ID is deleted during the
    session init in qla24xx_create_new_sess().

26. All target sessions are deleted in qlt_clear_tgt_db() as part of
    target shutdown in qlt_stop_phase1() and in case of LIP or if Port
    configuration of the target port was changed.

27. Session deletion happens in the dead code, qlt_fc_port_deleted().
    Perhaps the function can be safely removed.

28. Session is deleted for a port that issued LOGO, TPRLO and PRLO in
    qlt_xmit_tm_rsp().

29. Session is deleted in qlt_do_ctio_completion() during the I/O in
    target mode if firmware detects initiator logout and responds to the
    driver with a related error.

30.
31.
32. A PLOGI handler is going to delete stale sessions in
    qlt_find_sess_invalidate_other() if S_ID or N_Port handle of
    the PLOGI matches N_Port_ID or N_Port handle of an existing session.

33. PLOGI, PRLI will result in session deletion if the session is
    already established.

    FWIW, NVMe PRLI may destroy FCP session in case of simultaneous
    FCP/NVMe target, current code doesn't handle multiple process logins
    well. It shouldn't be a big deal since qla2xxx_nvmet hasn't been
    merged to mainline yet.

34. TPRLO, PRLO, LOGO result in session deletion in qlt_24xx_handle_els()
    After inspecting the code I think, only LOGO should fully delete the
    session.

    In the case of mixed FC-NVMe/FCP initiator/target, PRLO/TPRLO
    to/from SCSI target may break sessions of FC-NVMe, likewise FC-NVMe
    PRLO/TPRLO may break FCP sessions.

    PRLO and TPRLO (which is like PRLO from all ports) should only
    downgrade session of the requested FC-4 type to a kind of "process
    (i.e. FCP/FC-NVMe) logged out" state rather than delete it. TPRLO
    should log out all sessions of the FC-4 type on the target port,
    while PRLO should do it only for the originator of the ELS request.


> E.g. if after the RSCN the connection to the initiator is actually
> lost, either temporarily or for good? Would the session be deleted in
> some other code path, or would it just continue to lurk around?
> 

So, as a summary, it seems that the existing code mostly keeps sessions
until:
 - a conflict of N_Port_ID, N_Port_Name or N_Port handle detected;
 - the session is in the middle of refresh/rescan and RSCN arrives;
 - there's an explicit process or port logout from the session, i.e.
   PRLO, TPRLO, LOGO
 - there's a new port or process login from the same session, i.e.
   PLOGI, PRLI;
 - target is shut down;
 - target port link is reset;

And if a session of an initiator is deleted when driver works in target
mode, new session won't be established until a PLOGI and PRLI come from
the initiator.

The assumption is used qlt_handle_cmd_for_atio() and
qlt_handle_task_mgmt() to discard commands and TMFs from initiators that
are not logged in.

Thanks,
Roman



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux