On Thu, 2008-01-10 at 23:31 +0100, Krzysztof Helt wrote: > On Wed, 09 Jan 2008 17:51:43 -0600 > James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > diff -urp linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c > > > --- linux-ref/drivers/scsi/sym53c8xx_2/sym_glue.c 2007-12-23 20:39:44.000000000 +0100 > > > +++ linux-new/drivers/scsi/sym53c8xx_2/sym_glue.c 2008-01-09 22:22:30.000000000 +0100 > > > @@ -609,22 +609,22 @@ static int sym_eh_handler(int op, char * > > > */ > > > #define WAIT_FOR_PCI_RECOVERY 35 > > > if (pci_channel_offline(pdev)) { > > > - struct completion *io_reset; > > > int finished_reset = 0; > > > init_completion(&eh_done); > > > spin_lock_irq(shost->host_lock); > > > /* Make sure we didn't race */ > > > if (pci_channel_offline(pdev)) { > > > - if (!sym_data->io_reset) > > > - sym_data->io_reset = &eh_done; > > > - io_reset = sym_data->io_reset; > > > + BUG_ON(!sym_data->io_reset); > > > + sym_data->io_reset = &eh_done; > > > } else { > > > finished_reset = 1; > > > } > > > spin_unlock_irq(shost->host_lock); > > > if (!finished_reset) > > > - finished_reset = wait_for_completion_timeout(io_reset, > > > + finished_reset = wait_for_completion_timeout > > > + (sym_data->io_reset, > > > WAIT_FOR_PCI_RECOVERY*HZ); > > > + sym_data->io_reset = NULL; > > > > This has to be cleared under the host_lock to forestall the (tiny) race > > where the pci recovery code checks the value of sym_data->io_reset, we > > change it to null and then the pci recovery code completes a NULL > > pointer. > > > > It is impossible as the io_reset value is not NULL before and during wait > completion. The case above would happen only if one thread checked the > io_reset value (under lock) and it was NULL and before setting it (inside > locked section) another thread checked the io_reset value (still NULL > and also inside locked section = impossible). Otherwise, the BUG_ON() > kicked in (the value is already not NULL). Er, no. Let me be clearer about the sequence sym2_io_resume() is racing to complete sym_data->io_reset with sym_eh_handler()s timeout. Because you don't set it to NULL under the host_lock, you can get the sequence sym2_io_resume() acquires the host lock and checks the value of sym_data->io_reset and finds it to be not NULL. sym_eh_handler() NULLs sym_data->io_resume *without* acquiring the host lock. probably because the wait__for_completion_timeout() times out. sym2_io_resume() calls complete() on sym_data->io_resume which is now a NULL pointer and boom. We never get multiple threads into sym_eh_handler() because it's single threaded from the error handler thread. James - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html