On Wed, Jul 18, 2012 at 1:25 PM, Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > On Tue, 17 Jul 2012, James Bottomley wrote: > >> On Tue, 2012-07-17 at 12:51 -0700, Linus Torvalds wrote: >> > Wrong people cc'd, it looks like. >> > >> > Guys, commit a7a20d103994 ("sd: limit the scope of the async probe >> > domain") is causing boot problems. It's timing-dependent and >> > apparently sometimes works, which makes sense with that commit. >> > >> > However, it *should* have been fixed by commit 43a8d39d0137 ("fix >> > async probe regression"), but Artem seems to report the problem even >> > in -rc7. >> > >> > Comments? >> >> As far as I can tell, the fix should have worked. However, there are a >> lot of assumptions in the async stuff that end up not being true in the >> presence of separate async domains. We should be fixing it all here: >> >> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=shortlog;h=refs/heads/async > > Those commits haven't been merged into 3.5-rc7, right? So Artem isn't > using them. > >> I've got to say, I don't understand the bug report. all of those >> commits were about probing for devices. However, the screen shot >> >> https://bugzilla.kernel.org/attachment.cgi?id=75351 >> >> shows the devices were found, it's the partition tables that weren't. >> For us to see the message about sda's capacity, we're already in the >> async code the commits were trying to synchronise with. > > That's understandable, given that those commits aren't present in > Artem's kernel. Are they queued for -stable? I did not flag them as such given the wider entanglements. > >> However, there are some missing messages: there's no partition table >> print and no final >> >> sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n", >> sdp->removable ? "removable " : ""); >> >> So sd_probe_async() got stuck somewhere after the first >> sd_revalidata_disk(). > > No, it didn't get stuck. The problem is that the scsi_wait_scan module > didn't wait for the async scanning to finish. > > And the reason for that is one you're already familiar with: > CONFIG_SCSI_MOD=y. I had forgotten that additional aspect. but I don't think that is the root cause. scsi_wait_scan() is a nop in Artem's config given CONFIG_SCSI_WAIT_SCAN=m. From what I have seen ahci typically ensures that wait_for_device_probe() is all that is needed to guarantee scanning is complete which commit a7a20d103994 ("sd: limit the scope of the async probe domain") breaks, and which commit 43a8d39d0137 ("[SCSI] fix async probe regression"), does not resolve because that requires a call to scsi_complete_async_scans() to close the loop. > Artem, if you change all your SCSI drivers to be modular rather than > built-in, that ought to fix the problem. Alternatively, you can simply > continue to use the "rootwait" option. I think setting CONFIG_SCSI_WAIT_SCAN=y would also be a workaround, or going forward with the pending async rework to make sd probe work once again visible to async_synchronize_domain_full(). -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html