Hannes Reinecke wrote: > Hi Andrew, > > there is a regression in the qla2xxx driver, introduced by this commit: > > commit 083a469db4ecf3b286a96b5b722c37fc1affe0be > Author: Giridhar Malavali <giridhar.malavali@xxxxxxxxxx> > Date: Fri May 28 15:08:18 2010 -0700 > > [SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort. > > Hold a reference to the srb (sp) while aborting an I/O -- as the > I/O can/will complete from within the interrupt-context. > > Signed-off-by: Andrew Vasquez <andrew.vasquez@xxxxxxxxxx> > Signed-off-by: Giridhar Malavali <giridhar.malavali@xxxxxxxxxx> > Signed-off-by: James Bottomley <James.Bottomley@xxxxxxx> > > With this patch a reference counting is introduced for srb's. > However, there is this code in qla2xxx_eh_abort(): > > spin_unlock_irqrestore(&ha->hardware_lock, flags); > > /* Wait for the command to be returned. */ > if (wait) { > if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) { > qla_printk(KERN_ERR, ha, > "scsi(%ld:%d:%d): Abort handler timed out -- %lx " > "%x.\n", vha->host_no, id, lun, serial, ret); > ret = FAILED; > } > } > > if (got_ref) > qla2x00_sp_compl(ha, sp); > > > where qla2x00_eh_wait_on_command() is waiting for a command to be > completed by the midlayer. Which will never happen, as the refcount > is held during that time and only released on the last lines. > Hence any command abort will be timed out and the error will be > escalated further. > > I have fixed it by simply moving the last two lines above the > 'if (wait)' condition. however I fail to see the race condition > mentioned, and hence the validity of the reference counting in the > first place. > So it might be that I'm missing something subtle here, so I would > ask you to have a look here. > Actually, I found the race condition; it's here (mentioned commit removed): DEBUG2(printk("%s(%ld): aborting sp %p from RISC." " pid=%ld.\n", __func__, vha->host_no, sp, serial)); spin_unlock_irqrestore(&ha->hardware_lock, flags); if (ha->isp_ops->abort_command(sp)) { DEBUG2(printk("%s(%ld): abort_command " "mbx failed.\n", __func__, vha->host_no)); ret = FAILED; So the command referenced by 'sp' might've been completed after the unlock, in which case ->abort_command() will be accessing an invalid 'sp' point. Ok. But then my fix is indeed valid. I'll be sending a proper patch. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html