Thanks, Comments inline On Jan 17, 2012, at 11:04 PM, Dan Williams wrote: > On Tue, Jan 17, 2012 at 10:29 AM, Mark Salyzyn <mark_salyzyn@xxxxxxxxxxx> wrote: >> pm8001_query_task() and pm8001_abort_task() panic kernel when devices asynchronously disappear, a possible scenario since these functions are generally called when errors are mounting. Some of the panics are a direct result of a failure to NULL check some of the structure variables that are in certain states of teardown. One of the lockups was a direct result of returning an unexpected code to libsas' sas_scsi_find_task() function (creating a tight loop of an unexpected code 138 upstream to the scsi layer queue function). >> >> Signed-off-by: mark_salyzyn@xxxxxxxxxxx >> Cc: jack_wang@xxxxxxxxx >> Cc: JBottomley@xxxxxxxxxxxxx >> Cc: crystal_yu@xxxxxxxxx >> Cc: john_gong@xxxxxxxxx >> Cc: lindar_liu <lindar_liu@xxxxxxxxx> > > While your in the area of libsas error handling, mind weighing in on the pending libsas error handling rework backlog? > > http://git.kernel.org/?p=linux/kernel/git/djbw/isci.git;a=shortlog;h=refs/heads/libsas-eh-reworks-v4 Xyratex has me focussed on 2.6.32 problems, I will look at these, but will be off the reservation (and with no Q/A testing resource to keep me honest) when doing so ;-}. Pure Code Inspection thus trumps experimentation ... > ...more notes below: > >> @@ -986,7 +997,8 @@ int pm8001_abort_task(struct sas_task *task) >> struct pm8001_device *pm8001_dev; >> struct pm8001_tmf_task tmf_task; >> int rc = TMF_RESP_FUNC_FAILED; >> - if (unlikely(!task || !task->lldd_task || !task->dev)) >> + if (unlikely(!task || !task->lldd_task >> + || !task->dev || !task->dev->lldd_dev)) > > Hmm some of these are "never can happen" checks, so if you are indeed > seeing !task or !task->dev something is very wrong in libsas. checking > task->lldd_task and task->dev->lldd_dev should be all that is > required. OK, point taken, these issues surface with careless abandon in the 2.6.32-vintage kernels; and may, or may not, be present in 3.2.1-vintage kernels. What this means is that maybe we should be submitting these stabilization/band-aid patches to the stable trees rather than to the top-of-tree? I'd opt (not that I have the rights) for paranoia and adding these in at top-of-tree, as band-aids against panics (in libsas or pm8001). The checks are not costly IMHO (hopefully less-so as wrapped by unlikely()). My WAG is that this patch should be cancelled here, and placed into the 2.6.32-stable bucket ... Looking forward to yours and Greg's comments to straighten me out. > Also, I think it confuses the situation to return TMF_RESP_FUNC_FAILED > in this case because the truth of the matter, as far as error handling > is concerned, is that the lldd has forgotten the task, but you would > need to check that the lldd does not subsequently call task->task_done > in some other path. In our private copy of the pm8001 driver, we have some debug code that performs tag tracking and memory-leak detection. In that code we ensure that task_done is not called for a tag that has already been released by another path. In 2.6.32-vintage kernels, when we hit these we try to remove them (those patches are in my back-log). > Dan Sincerely -- Mark Salyzyn -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html