On 09/01/2019 16:30, David Gibson wrote: > On Wed, Jan 09, 2019 at 04:09:02PM +1100, Benjamin Herrenschmidt wrote: >> On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: >>> >>>> In a very cryptic way that requires manual parsing using non-public >>>> docs sadly but yes. From the look of it, it's a completion timeout. >>>> >>>> Looks to me like we don't get a response to a config space access >>>> during the change of D state. I don't know if it's the write of the D3 >>>> state itself or the read back though (it's probably detected on the >>>> read back or a subsequent read, but that doesn't tell me which specific >>>> one failed). >>> >>> If it is just one card doing it (again, check you have latest >>> firmware) I wonder if it is a sketchy PCI-E electrical link that is >>> causing a long re-training cycle? Can you tell if the PCI-E link is >>> permanently gone or does it eventually return? >> >> No, it's 100% reproducable on systems with that specific card model, >> not card instance, and maybe different systems/cards as well, I'll let >> David & Alexey comment further on that. > > Well, it's 100% reproducable on a particular model of system > (garrison) with a particular model of card. I've had some suggestions > that it fails with some other systems card card models, but nothing > confirmed - the one other system model I've been able to try, which > also had a newer card model didn't reproduce the problem. I have just moved the "Mellanox Technologies MT27700 Family [ConnectX-4]" from garrison to firestone machine and there it does not produce an EEH, with the same kernel and skiboot (both upstream + my debug). Hm. I cannot really blame the card but I cannot see what could cause the difference in skiboot either. I even tried disabling NPU so garrison would look like firestone, still EEH'ing. >>> Does the card work in Gen 3 when it starts? Is there any indication of >>> PCI-E link errors? >> >> Nope. >> >>> Everytime or sometimes? >>> >>> POWER 8 firmware is good? If the link does eventually come back, is >>> the POWER8's D3 resumption timeout long enough? >>> >>> If this doesn't lead to an obvious conclusion you'll probably need to >>> connect to IBM's Mellanox support team to get more information from >>> the card side. >> >> We are IBM :-) So far, it seems to be that the card is doing something >> not quite right, but we don't know what. We might need to engage >> Mellanox themselves. > > Possibly. On the other hand, I've had it reported that this is a > software regression at least with downstream red hat kernels. I > haven't yet been able to eliminate factors that might be confusing > that, or try to find a working version upstream. Do you have tarballs handy? I'd diff... -- Alexey