Hi, while you are chasing some problem with i2c_801 I would like to mention that I never got an answer on the thread https://lkml.org/lkml/2013/1/23/405 about a kmemleak reported by kernel . Maybe this could give you a hint? If these do not overlap I would be anyways glad to receive an answer via the original thread I have started. Thank you, Martin Jean Delvare wrote: > Hi Robert, > > On Thu, 16 May 2013 13:44:55 +1000, Robert Norris wrote: >> On Wed, May 15, 2013 at 09:49:23PM +0200, Jean Delvare wrote: >>>> Interrupt: pin B routed to IRQ 0 >>> >>> Hmm, this "IRQ 0" is quite odd. I'm wondering if this could be the >>> reason for this hang. Was it with the i2c-i801 driver loaded, or >>> blacklisted? Please check if it makes a difference. >> >> That was without the driver loaded (blacklisted). After loading (with >> interrupts enabled) we get: >> >> Interrupt: pin B routed to IRQ 20 > > For the record, I also see the IRQ value change after loading the > i2c-i801 driver on my system (with an ICH10 south bridge.) From 14 to > 22 in my case. So it's a bit different (no IRQ 0) but not still > somewhat similar, so I'm still not sure if this has anything to do with > your issue. > >> >>> Do you see the same (and more generally, this issue) on one, some or >>> all of your x3550 servers? >> >> The issue has occured on at least three x3550s (we have 11). I haven't >> tested more, because knowingly crashing production machines sucks. > > Yes of course, I understand, I did not expect you to do that ;) > >> This appears to be the case on other machines. With the module >> blacklisted (never loaded), lspci shows IRQ 0. After load, IRQ 20. >> (tested on 3.4 and 3.9). > > OK. > >>> Are you using IPMI on these machines? >> >> Yes, but only for monitoring/sensors, if that makes a difference. > > IPMI is still likely to access the SMBus controller. If there's a BMC > in the machine, it can also access the SMBus slave with its own > controller. It would be good to rule this out by disabling IPMI > completely, removing the BMC from the machine if it has one, and > checking if it makes the issue go away or not. > >>> I would appreciate if you could test the following: >>> * Blacklist i2c-i801 and ics932s401 so that none of them get >>> auto-loaded. >> >> Done. >> >>> * Manually load i2c-i801 with interrupts enabled, and see what >>> happens. >> >> Returned immediately: >> >> [ 60.527140] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt > > This confirms that the i2c-i801 driver loading itself isn't the problem. > >>> * If no hang happens, load i2c-dev, find the i801 bus number with >>> i2cdetect -l (from the i2c-tools package - it should be 4 according >>> to what you reported so far but there is no guarantee that it won't >>> change across reboots.) >> >> $ i2cdetect -l >> i2c-0 i2c Radeon i2c bit bus DVI_DDC I2C adapter >> i2c-1 i2c Radeon i2c bit bus VGA_DDC I2C adapter >> i2c-2 i2c Radeon i2c bit bus MONID I2C adapter >> i2c-3 i2c Radeon i2c bit bus CRT2_DDC I2C adapter >> i2c-4 smbus SMBus I801 adapter at 0440 SMBus adapter >> >>> Then do a simple read from a random address >>> with: >>> # i2cget 4 0x50 0x00 >>> (Adjust the bus number as needed.) >>> I am curious if this will hang as well or only when accessing the >>> clock chip at address 0x69. >> >> Yep, that one hangs. The hung task handler picked it up after a few >> minutes. > > OK, this means that any transaction request to the SMBus controller > causes the hang. > > The i2c-i801 driver is optimistically using wait_event() when waiting > for an interrupt to arrive. I suppose that the interrupt is never > delivered in your case (all 0 in /proc/interrupts.) > > Daniel, shouldn't we use wait_event_timeout() instead to catch issues > like this and fail cleanly? Maybe even fallback to polling > automatically? > -- To unsubscribe from this list: send the line "unsubscribe linux-i2c" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html