On 02/10/2015 08:49 AM, Bjorn Helgaas wrote:
We need to work out what's going wrong here before we rush into a band-aid.
What changed between v3.4 and v3.4.1 that exposed this problem? "git
log --oneline v3.4..v3.4.1" doesn't show any likely culprits. Paul,
are those the versions you tested? Your dmesg logs at
https://bugzilla.kernel.org/show_bug.cgi?id=92351 show
"3.4.0-030400-generic" and "3.4.1-030401-generic" but I don't know
whether those are precisely v3.4 and v3.4.1.
I assume this system works fine with Windows, and I doubt Windows has
a hack like "never move LSI devices." So it would be useful to know
if we're doing something stupid in Linux that makes us trip over this.
Paul, if you happen to have Windows on this machine as well, a
complete AIDA64 report (free trial version at http://www.aida64.com)
would show what Windows did.
The resource allocation we're doing is related SR-IOV, and
unfortunately we don't print enough information in dmesg to figure
everything out. Paul, can you attach the complete "lspci -vv" output
to the bugzilla?
Bjorn
The system I have had this problem on is in production, though it should
be replaced by a real server. Because it is in use, I have used a
separate boot disk to test kernels. I also have limited access to take
the machine down. The system runs ubuntu server, though I have used an
ubuntu desktop to test kernels. There is not a windows system on the
machine, though, just guessing, LSI likely provides the windows driver
and that driver may well have dealt with a problem that is looking to be
specific to a firmware/bios version on this card.
Someone found another of these cards here, so I tried it last night in
an unused machine. It worked on the ubuntu 3.13 kernel without realloc.
The card that has been the problem has these versions of firmware:
[ 9.004647] mpt2sas0: LSISAS2008: FWVersion(17.00.01.00),
ChipRevision(0x03), BiosVersion(07.33.00.00)
and the card that works has a newer version:
[ 15.725011] mpt2sas0: LSISAS2008: FWVersion(18.00.00.00),
ChipRevision(0x03), BiosVersion(07.35.00.00)
Now, the cards are in very different machines so the difference could be
due to the machines and not the firmware, but I would tend to go with
the firmware difference. LSI firmware is now beyond both these firmware
versions, but if I can find a copy of the older firmware, I'll try it on
the card with the newer firmware.
Just a suggestion, but from the linux end, if you could trap the older
firmware version and put a message out about the realloc flag and
firmware version, that would help someone else who might fall into the
same hole I found myself in.
Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html