Re: [RFC] qla2xxx: Add dev_loss_tmo kernel module options

Benjamin Block <bblock@xxxxxxxxxxxxx> · Tue, 20 Apr 2021 19:27:00 +0200

On Mon, Apr 19, 2021 at 12:00:14PM +0200, Daniel Wagner wrote:
> Allow to set the default dev_loss_tmo value as kernel module option.
> 
> Cc: Nilesh Javali <njavali@xxxxxxxxxxx>
> Cc: Arun Easi <aeasi@xxxxxxxxxxx>
> Signed-off-by: Daniel Wagner <dwagner@xxxxxxx>
> ---
> Hi,
> 
> During array upgrade tests with NVMe/FC on systems equiped with QLogic
> HBAs we faced the problem with the default setting of dev_loss_tmo.
> 
> When the default timeout hit after 60 seconds the file system went
> into read only mode. The fix was to set the dev_loss_tmo to infinity
> (note this patch can't handle this).
> 
> For lpfc devices we could use the sysfs interface under
> fc_remote_ports which exposed the dev_loss_tmo for SCSI and NVMe
> rports.
> 
> The QLogic only expose the rports via fc_remote_ports if SCSI is used.
> There is the debugfs interface to set the dev_loss_tmo but this has
> two issues. First, it's not watched by udevd hence no rules work. This
> could be somehow worked around by setting it statically, but that is
> really only an option for testing. Even if the debugfs interface is
> used there is a bug in the code. In qla_nvme_register_remote() the
> value 0 is assigned to dev_loss_tmo and the NVMe core will use it's
> default value 60 (this code path is exercised if the rport droppes
> twice).
> 
> Anyway, this patch is just to get the discussion going. Maybe the
> driver could implement the fc_remote_port interface? Hannes was
> pointing out it might make sense to think about an controller sysfs
> API as there is already a host and the NVMe protocol is all about host
> and controller.
> 
> Thanks,
> Daniel
> 
>  drivers/scsi/qla2xxx/qla_attr.c | 4 ++--
>  drivers/scsi/qla2xxx/qla_gbl.h  | 1 +
>  drivers/scsi/qla2xxx/qla_nvme.c | 2 +-
>  drivers/scsi/qla2xxx/qla_os.c   | 5 +++++
>  4 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_attr.c b/drivers/scsi/qla2xxx/qla_attr.c
> index 3aa9869f6fae..0d2386ba65c0 100644
> --- a/drivers/scsi/qla2xxx/qla_attr.c
> +++ b/drivers/scsi/qla2xxx/qla_attr.c
> @@ -3036,7 +3036,7 @@ qla24xx_vport_create(struct fc_vport *fc_vport, bool disable)
>  	}
>  
>  	/* initialize attributes */
> -	fc_host_dev_loss_tmo(vha->host) = ha->port_down_retry_count;
> +	fc_host_dev_loss_tmo(vha->host) = ql2xdev_loss_tmo;
>  	fc_host_node_name(vha->host) = wwn_to_u64(vha->node_name);
>  	fc_host_port_name(vha->host) = wwn_to_u64(vha->port_name);
>  	fc_host_supported_classes(vha->host) =
> @@ -3260,7 +3260,7 @@ qla2x00_init_host_attr(scsi_qla_host_t *vha)
>  	struct qla_hw_data *ha = vha->hw;
>  	u32 speeds = FC_PORTSPEED_UNKNOWN;
>  
> -	fc_host_dev_loss_tmo(vha->host) = ha->port_down_retry_count;
> +	fc_host_dev_loss_tmo(vha->host) = ql2xdev_loss_tmo;
>  	fc_host_node_name(vha->host) = wwn_to_u64(vha->node_name);
>  	fc_host_port_name(vha->host) = wwn_to_u64(vha->port_name);
>  	fc_host_supported_classes(vha->host) = ha->base_qpair->enable_class_2 ?
> diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
> index fae5cae6f0a8..0b9c24475711 100644
> --- a/drivers/scsi/qla2xxx/qla_gbl.h
> +++ b/drivers/scsi/qla2xxx/qla_gbl.h
> @@ -178,6 +178,7 @@ extern int ql2xdifbundlinginternalbuffers;
>  extern int ql2xfulldump_on_mpifail;
>  extern int ql2xenforce_iocb_limit;
>  extern int ql2xabts_wait_nvme;
> +extern int ql2xdev_loss_tmo;
>  
>  extern int qla2x00_loop_reset(scsi_qla_host_t *);
>  extern void qla2x00_abort_all_cmds(scsi_qla_host_t *, int);
> diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c
> index 0cacb667a88b..cdc5b5075407 100644
> --- a/drivers/scsi/qla2xxx/qla_nvme.c
> +++ b/drivers/scsi/qla2xxx/qla_nvme.c
> @@ -41,7 +41,7 @@ int qla_nvme_register_remote(struct scsi_qla_host *vha, struct fc_port *fcport)
>  	req.port_name = wwn_to_u64(fcport->port_name);
>  	req.node_name = wwn_to_u64(fcport->node_name);
>  	req.port_role = 0;
> -	req.dev_loss_tmo = 0;
> +	req.dev_loss_tmo = ql2xdev_loss_tmo;
>  
>  	if (fcport->nvme_prli_service_param & NVME_PRLI_SP_INITIATOR)
>  		req.port_role = FC_PORT_ROLE_NVME_INITIATOR;
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index d74c32f84ef5..c686522ff64e 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -338,6 +338,11 @@ static void qla2x00_free_device(scsi_qla_host_t *);
>  static int qla2xxx_map_queues(struct Scsi_Host *shost);
>  static void qla2x00_destroy_deferred_work(struct qla_hw_data *);
>  
> +int ql2xdev_loss_tmo = 60;
> +module_param(ql2xdev_loss_tmo, int, 0444);
> +MODULE_PARM_DESC(ql2xdev_loss_tmo,
> +		"Time to wait for device to recover before reporting\n"
> +		"an error. Default is 60 seconds\n");

Wouldn't that be really really confusing, if you set essentially the
same thing with two different knobs for one FC HBA? We already have
a `dev_loss_tmo` kernel parameter - granted, only for scsi_transport_fc;
but doesn't qla implement that as well?

I don't really have any horses in this race here, but that sounds
strange.

-- 
Best Regards, Benjamin Block  / Linux on IBM Z Kernel Development / IBM Systems
IBM Deutschland Research & Development GmbH    /    https://www.ibm.com/privacy
Vorsitz. AufsR.: Gregor Pillen         /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294