Re: qla2xx command abort regression

Hannes Reinecke <hare@xxxxxxx> · Fri, 01 Oct 2010 14:14:53 +0200

Hannes Reinecke wrote:
> Hi Andrew,
> 
> there is a regression in the qla2xxx driver, introduced by this commit:
> 
> commit 083a469db4ecf3b286a96b5b722c37fc1affe0be
> Author: Giridhar Malavali <giridhar.malavali@xxxxxxxxxx>
> Date:   Fri May 28 15:08:18 2010 -0700
> 
>     [SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort.
> 
>     Hold a reference to the srb (sp) while aborting an I/O -- as the
>     I/O can/will complete from within the interrupt-context.
> 
>     Signed-off-by: Andrew Vasquez <andrew.vasquez@xxxxxxxxxx>
>     Signed-off-by: Giridhar Malavali <giridhar.malavali@xxxxxxxxxx>
>     Signed-off-by: James Bottomley <James.Bottomley@xxxxxxx>
> 
> With this patch a reference counting is introduced for srb's.
> However, there is this code in qla2xxx_eh_abort():
> 
> 	spin_unlock_irqrestore(&ha->hardware_lock, flags);
> 
> 	/* Wait for the command to be returned. */
> 	if (wait) {
> 		if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
> 			qla_printk(KERN_ERR, ha,
> 			    "scsi(%ld:%d:%d): Abort handler timed out -- %lx "
> 			    "%x.\n", vha->host_no, id, lun, serial, ret);
> 			ret = FAILED;
> 		}
> 	}
> 
> 	if (got_ref)
> 		qla2x00_sp_compl(ha, sp);
> 
> 
> where qla2x00_eh_wait_on_command() is waiting for a command to be
> completed by the midlayer. Which will never happen, as the refcount
> is held during that time and only released on the last lines.
> Hence any command abort will be timed out and the error will be
> escalated further.
> 
> I have fixed it by simply moving the last two lines above the
> 'if (wait)' condition. however I fail to see the race condition
> mentioned, and hence the validity of the reference counting in the
> first place.
> So it might be that I'm missing something subtle here, so I would
> ask you to have a look here.
> 
Actually, I found the race condition; it's here
(mentioned commit removed):

		DEBUG2(printk("%s(%ld): aborting sp %p from RISC."
		" pid=%ld.\n", __func__, vha->host_no, sp, serial));

		spin_unlock_irqrestore(&ha->hardware_lock, flags);
		if (ha->isp_ops->abort_command(sp)) {
			DEBUG2(printk("%s(%ld): abort_command "
			"mbx failed.\n", __func__, vha->host_no));
			ret = FAILED;

So the command referenced by 'sp' might've been completed after
the unlock, in which case ->abort_command() will be accessing an
invalid 'sp' point. Ok. But then my fix is indeed valid.

I'll be sending a proper patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html