On Fri, 2011-11-11 at 13:42 +0400, Vasily Averin wrote: > Aacraid controller can hang on some nodes if kernel uses non-default > (powersave) ASPM policy. > Controller hangs shortly after successful load and hardware detection. > Scsi error handler detects this hang and tries to restart hardware but > it does not help. > > Initially it was noticed on RHEL6-based openVZ kernel after > backporting aacraid driver from mainline (RHEL6 kernel with original > driver works well) > http://bugzilla.openvz.org/show_bug.cgi?id=2043 > > This issue happens because default ASPM policy was changed in Red Hat > kernels. Therefore guys from Red Hat have noticed this problem long > time ago: > on Fedora 12 > https://bugzilla.redhat.com/show_bug.cgi?id=540478 > on Fedora 14 > https://bugzilla.redhat.com/show_bug.cgi?id=679385 > > In RHEL6 kernel this issue was fixed, ASPM was disabled in aacraid > driver. In kernel changelog I've found that seems it was done by > Matthew Garrett: > - [scsi] aacraid: Disable ASPM by default (Matthew Garrett) [599735] > > However seems this patch was not submitted to mainline. I've > reproduced this issue on vanilla 3.1.0 kernel booted with > "pcie_aspm.policy=powersave" option, > So I believe it makes sense to do it now. > > I've reviewed similar issues and found that similar troubles happen > with another hardware too. For example similar patch can be found in > e1000 driver. Do you have a comprehensive list? If it's just a couple of drivers in each subsystem, then adding what amount to device quirks in the driver seems to be appropriate. If it's a huge number, we might have to rethink how this feature is implemented. The next question is: is the driver the correct place? This sounds like a PCIe Link Power Management blacklist set ... which might need updating on the fly ... might we need a user knob for this (like we have for the SCSI black/white list)? James ��.n��������+%������w��{.n�����{������ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f