On Thu, 15 Sep 2005, James Bottomley wrote: > Well, I think the symptoms are racing scsi_remove_host() calls and the > solution is to enforce the state model on removal (as in if the host is > already in the remove state, don't try to remove it again). You're forgetting something: Devices can be removed at any time, no matter what state the host is in. It's even possible for thread A to be removing a device while thread B is removing the host. The A thread will interfere with the B thread, because the "list_for_each_entry_safe" loops in scsi_forget_host and __scsi_remove_target _aren't_ safe against other threads removing devices. Also remember that the list pointers get changed when a device is _released_, which can occur quite some time after it is _removed_. In short, those iterations must be carried out as in my patch. > > I still do not understand as I asked in a previous comment why we are not > > shutting down the eh_thread in scsi_remove_host and also why simpler state > > model updates could not solve the problem. > > Well, it goes back to whether we wait for the thread or not. To shut > the thread down, we also need to wait for it to complete. > > As far as the state model goes, we either need to wait for the eh thread > before transitioning to cancel or introduce the extra states that > reflect the parallel eh transitions. Agreed, one or the other of these is needed. I think the parallel-states approach offers a little more flexibility. > > I believe I also indicated that we could enhance scsi_error to shutdown > > faster during this state which should only be a performance improvement. > > Yes, we could ... patches? During the DEL_RECOVERY state you could maybe make the error handler faster by removing the settle delays. I'm not sure what else you can do -- all attempts at submitting commands will quickly fail. During CANCEL_RECOVERY you may not want to speed up the error handler. In this state it needs to function normally, since there might be errors during the cache-flush sequence or similar things (for non-forced removal). Finally, having said all that, it turns out there's still a bug in the 2.6.14-rc1 code. I feel pretty stupid about it -- you'll understand why when you read the patch below. Alan Stern Signed-off-by: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> --- a/drivers/scsi/scsi_sysfs.c Mon Sep 12 23:12:09 2005 +++ b/drivers/scsi/scsi_sysfs.c Thu Sep 15 21:32:27 2005 @@ -707,9 +707,11 @@ **/ void scsi_remove_device(struct scsi_device *sdev) { - down(&sdev->host->scan_mutex); + struct Scsi_Host *shost = sdev->host; + + down(&shost->scan_mutex); __scsi_remove_device(sdev); - up(&sdev->host->scan_mutex); + up(&shost->scan_mutex); } EXPORT_SYMBOL(scsi_remove_device); - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html