On Sun, Jul 13, 2014 at 11:41 PM, Kyle Auble <kyle.auble@xxxxxxxx> wrote: > Hello, I wanted to keep this email short, but my questions are all > interconnected. My GPU is an on-board Nvidia GeForce 8400M GT > (pci id [10de:0426]), and since at least kernel v3.2, the generic x86 > kernel only loads the device 1 in 10 times. This is still true as of > v3.16-rc3. Honestly, it's probably something that the BIOS should > prevent, but I've checked and there are no relevant options or upgrades > for my BIOS (on a Sony Vaio VGN-FZ260E). > > I've been tracking this problem at launchpad.net on-and-off for a > couple years now, This is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1009312, right? Please collect the following information from two boots of the newest kernel you have: the first on battery power where the GPU works fine, and the second on AC power where the GPU driver fails to load: - complete dmesg log - "lspci -vvxxx" output for the whole system In addition, please collect an acpidump (this will be the same either way, so you only need one copy). I see that in https://launchpadlibrarian.net/106978987/baddmesg.log, you were using the proprietary nvidia driver. If the problem is that that driver isn't loading, there isn't much we can do because it's closed-source. But if you're seeing a problem before loading that driver, there might be something we can fix. Bjorn > but I don't think it's a common issue, and I have > some free time to try resolving it myself now. I'm new to system > programming though so I was wondering: Does the issue I'm seeing fit a > pattern of some kind? Can someone help me understand how the symptoms > fit together and where they come from? Or if I need to do more > analysis, what would probably be the best approach? > > 1.The key thing I discovered is that whenever the GPU does load, a ~6ms > gap appears in the dmesg logs during the GPU's pci initialization. When > the GPU fails to load though, this gap grows to 30ms. Also, I've > pinpointed the delay (with dev_info statements) to: > pcie_aspm_configure_common_clock in drivers/pci/pcie/aspm.c > > After some googling, I came across powerpoints from the PCI-SIG > organization that mention 24ms as precisely the PCIe specified timeout > for some states of link training, and sure enough, this function tells > the bridge upstream of the GPU to retrain the link. However, even when > the GPU fails to load and 30ms is spent in the function, the dev_err > towards the end of the function doesn't print. > > 2.Now the first reason I'm pretty certain that this isn't strictly a > hardware issue beyond recovery is that there's a workaround. If I make > sure my computer is running off of the battery, without AC power, for > that first second of kernel initialization, the GPU always loads. I've > tried this dozens of times. I don't clearly understand why, but I've > read that the power-saving link states do correspond to distinct states > in the link-training state machine. > > 3.The next fact (that I have no explanation for) is that the situation > reverses almost exactly on the amd64 kernel. The 64-bit kernel boots > the GPU fine 9 times out of 10, but there is still the occasional > session where the 30ms gap appears and the GPU never loads. > > 4.To keep things simple, I also tried inserting dev_info statements > within the different branches of pcie_aspm_configure_common_clock, but > this made the problem disappear (and there was only a 6ms gap). I tried > once more with fewer statements to reduce overhead, which did increase > the time gap to 11ms but still allowed the GPU to load. The idea that > more overhead in the function affects timing makes sense to me, but > that it decreases time spent in the function is counter-intuitive. > > 5.Finally, before I started looking through the code, I tried some git > bisections because there was a brief time in summer of 2013 where the > problem went away. The commit that resolved it turned out to be: > d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong > After the problem returned, I tried another bisection, but wound up > doing a manual bisection instead of using git bisect (I honestly don't > remember why). The commit I found that reintroduced the problem was: > ee8209fd026b074bb8eb75bece516a338a281b1b by Andy Shevchenko > > What stumps me is that neither of these commits appears directly > related to the pci subsystem. Because it wasn't a normal bisection that > returned Andy's commit and I didn't test that build as much, I still > wonder if it's a false positive. However, I've tested a kernel built at > Xiao's commit many times so I'm confident it resolved the issue, though > my hypothesis is that it's purely by a subtle side effect of how the > raw assembly is loaded into memory at startup. > > Again, I apologize for the length, but I'd be grateful for any advice. > I'm not registered on the mailing list so I would appreciate being > CC'ed in any replies. I don't plan on becoming a regular kernel hacker > anytime soon, just want to do my tiny part to help. > > Sincerely, > Kyle Auble > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html