On Fri, 2009-08-21 at 15:51 +0100, Chris Webb wrote: > James Bottomley <James.Bottomley@xxxxxxx> writes: > > > On Fri, 2009-08-21 at 10:23 +0100, Chris Webb wrote: > > > > > Sorry to follow up a third time, but I can now confirm this. I slipped -g into > > > CFLAGS in the kernel Makefile and rebuilt genhd.o and then the entire vmlinux. > > > > I suppose it makes sense: That was the only dereference at offset 16 I > > could find in the code. The thing which doesn't quite make sense is > > that disk_part_iter_init() also dereferences the same pointer > > successfully ... I suppose this could be a race with another thread to > > null out the gendisk part_tbl ... I'll have to think about it some more. > > Thanks! If it helps, I've only ever seen it following an iscsi login to a > target machine which is heavy loaded (e.g. RAID resync in this case), > presumably meaning that everything (including disk reads) happens a bit > slowly. Perhaps this increases the window for a race in some way? > > I've spent some time over the past week trying to reproduce it in a VM with > magic sysrq enabled so I could find out a bit more, but it subbornly refuses > to happen except on machines in a busy production cluster. Actually, for that particular pointer to be NULL'd, I think the race must be between add_disk and del_gendisk, implying that your iSCSI cluster somehow shut down the link while it was busy. I think that's an artifact of the fact that we don't get a reference to the disk in these operations, and the race window is much longer now we do sd async scanning. Can you try this as a partial fix? (It should prevent the oops, but you'll still lose the disk). James --- diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b7b9fec..a89c421 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2021,6 +2021,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n", sdp->removable ? "removable " : ""); + put_device(&sdkp->dev); } /** @@ -2106,6 +2107,7 @@ static int sd_probe(struct device *dev) get_device(&sdp->sdev_gendev); + get_device(&sdkp->dev); /* prevent release before async_schedule */ async_schedule(sd_probe_async, sdkp); return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html