On 6/18/19, 8:57 AM, "Bart Van Assche" <bvanassche@xxxxxxx> wrote: On 6/17/19 5:01 PM, Quinn Tran wrote: > Attached is the clean-up patch that we held back from the series. > We felt it wasn't ready for wider audience because it needed additional > soak time with our test group. > > We want to ahead and share it with you to let you know that we intent > to cleanup the duplicate atomic [ref_count|kref]. Once it has some > soak time in our test group, we'll submit it in the next RC window. Hi Quinn, Thank you for having shared that patch early. My comments about that patch are as follows: - The patch description is not correct. Today freeing of an SRB does not happen when ref_count reaches zero but it happens when the firmware reports a completion. That is why today the abort code can trigger a use-after-free. ref_count is only useful today for the abort code to detect atomically whether or not the firmware already reported that a request completed. QT: Bart, thanks for the additional eyes. This patch made additional safe guard to prevent use after free. When the original IO arrive into QLA driver, QLA will bind scsi_cmnd & srb together. When the kref reaches 0, we will unbind the 2 structures with NULL pointer. Any attempt on 'use after free' will be block by null pointer. We reserve additional scratch space at the end of scsi_cmnd (srb_private) to facilitate the bind and unbind. 879 qla_release_fcp_cmd_kref(struct kref *kref) 880 { 881 struct srb *sp = container_of(kref, struct srb, cmd_kref); 882 struct scsi_cmnd *cmd = GET_CMD_SP(sp); 884 struct srb_private *priv = (struct srb_private*)(cmd + 1); .. 888 spin_lock_irqsave(&priv->cmd_lock, flags); 889 CMD_SP(cmd) = NULL; <<<< unbind scsi_cmnd from srb. 890 sp->u.scmd.cmd = NULL; 891 spin_unlock_irqrestore(&priv->cmd_lock, flags); } 1359 static int 1360 qla2xxx_eh_abort(struct scsi_cmnd *cmd) 1361 { .. 1370 struct srb_private *priv = (struct srb_private*)(cmd + 1); .. 1382 1383 spin_lock_irqsave(&priv->cmd_lock, flags); 1384 sp = (srb_t *) CMD_SP(cmd); 1385 if (!sp || !sp->qpair || 1386 (priv->cmd_id != sp->cmd_id) || 1387 (sp->fcport && sp->fcport->deleted)) { 1388 spin_unlock_irqrestore(&priv->cmd_lock, flags); 1389 return SUCCESS; 1390 } - Only calling cmd->scsi-done(cmd) when the reference count reaches zero involves a behavior change. If a command completion and a request to abort a command race, this patch will report the command as aborted to the SCSI mid-layer instead of as completed. This change has not been mentioned in the patch description. Is this change perhaps unintentional? QT: I believe the original code has the same behavior. We're trying to preserve precedent. Will revisit to differentiate the 2 status code the described race. - I think that this patch does not address the memory leak that can be triggered by aborting a command. If a command is aborted it will be freed by qla_release_fcp_cmd_kref() calling qla2xxx_rel_qpair_sp() and by qla2xxx_rel_qpair_sp() calling sp->free(sp). However, the implementation of the free function for multiple SRB types is incomplete. QT: As for the other type SRB, I follow what you're pointing at. We do have another patch(is) to address this other corner. Will queue it up to go out also with the next window. It's currently being soak in our test group. - The approach for avoiding that qla2xxx_eh_abort() triggers a use-after-free (the new srb_private structure) is interesting. However, that approach does not look correct to me. The patch attached to the previous e-mail inserts the following code in qla2xxx_eh_abort(): 'sp = CMD_SP(cmd); if (!sp || !sp->qpair || ...) return SUCCESS' As one can see the sp pointer is dereferenced although the memory it points at could already have been freed and could have been reallocated by another driver or another process. So I don't think the new code for avoiding a use-after-free is correct. QT: OK. Will remove old defensive check that may still cause use after free. ---- Thanks, Bart.