BIOS Corruption (was : new abituguru driver in mm kernel)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

I was on a long weekend / short vacation hence the slow response.

Jean Delvare wrote:
> Hi Sunil, Hans,
> 
>> I found an issue with the driver. It seems to have put something in the
>> uguru part of BIOS, which makes my BIOS hang when I enter Abit uGuru
>> temperature monitor screen in the BIOS. No keys work and only way for me at
>> that point is to give it the three finger salute. The peculiar thing I
>> noticed was that somehow it modified the shutdown and beep CPU temperatures
>> to 245C and 235C while I had them set at 65C and 75C.
>>
>> As soon as the temp monitor displays 245C in CPU row, it just hangs. The
>> next values to be displayed are the enable bit for these temps, I think.
>>
>> I never used the driver to write anything to BIOS, so how did it end up
>> updating those values?
> 
> Err, this is bad :(
> 
> Hans, see why I was reluctant to having your driver in the kernel tree?
> Nothing personal, but that kind of problem is to be expected when
> writing a driver without a datasheet :(
> 
This would / could have happened with a datasheet too, its a combination
of a buggy chip (the uguru) which is a pain to work with (it would be
even with a datasheet) and a bug in my driver.

> What should we do now? Mark the driver broken in the kernel tree?
> 2.6.18 is just around the corner, and we sure don't want people to
> corrupt their BIOS.
> 

Things are not as bad as they look, the driver did not corrupt the BIOS,
nor the CMOS ram. For some strange reason (design probably) the Abit
uGuru motherboards BIOS saves the uguru settings to CMOS on successfull
(ACPI) poweroff. So all the driver did was write the uGuru register it
was designed to write, its not writing over random memory.

Also notice that the BIOS is also being buggy in this case, because it
hangs because of a single invalid byte in the uguru's settings, whereas
it should just be checking the relevant bits and ignoring the irrelevant
ones.

The problem has been very correctly analysed by Sergey Vlasov
<vsu at altlinux.ru>, there is indeed a bug in the driver where it fails to
restore the original settings when any read fails during the sensor type
detect code. Attached is a version which fixes this bug and which will
always restore the settings (if changed) no matter how that function is
left. To be really sure it tries the restore up to 3 times in case of a
writing error.

Sergey also correctly points out that this code is writing the uguru
todo the probing and that this is not ideal, unfortunatly, I cannot
think of another way.

I do not believe this bug will reoccur, not with sunil's new and
improved timeout handling, which make the chance of actually getting a
timeout error pretty small, combine this with the always restoring the
settings on an error, instead of the current buggy behaviour and
retrying this restore in case of a write failure (due to those same
timeouts, which are pretty rare now), the total chance of this happening
again is very small. And even if it happens a CMOS reset is enough, most
likely load setup defaults is enough, that fixed a similar issue for me
when I completly trashed the uguru during driver development (I was
probing random uguru bank addresses, which the uguru did not like).

To make really really sure we never get bitten by this I've added code
to the sensors detect code, to mask out any invalid bits from the
original settings byte before restoring it to the uguru.

Sunil,

Could you try the attached abituguru.c, after loading it once, it should
have fixed those invalid settings and you should be able to reenter the
relevant screen in your BIOS' setup just fine.


Here is a complete list of all the changes in this version compared to
the one in the kernel. I've also attached a unified diff between the
latest in kernel version and the attached one, so that it can get an
initial review. If this version fixes Sunil's problems as expected
without him needing to reset his BIOS, I'll then send an offical PATCH
with these changes for merging:

-Much improved timeout / wait for status handling. Many thanks to Sunil
 Kumar, for all his testing, ideas and patches! The code now first busy
 waits, polling the uguru for the expected status as this usual succeeds and
 it succeeds pretty quickly (within 90 reads). To avoid unnescesarry CPU
burn
 in timeout conditions, the amount of busy waiting has been halved from
 previous versions (120 tries instead of 250). This is not a problem,
because
 this version goes to sleep after 120 attemps for 1ms and then tries
again,
 it does this sleep and try again 5 times before finally giving up. This
 (almost?) completly removes the timeout errors some people have seen
 regulary. Appearantly some older uguru versions sometimes are distracted
 for a (relativly) long time. This solves this.
-These timeout errors not only occur in the sending address part of reading
 the uguru but also in the wait for read state, so errors in this state are
 now handled as retryable just like send address state errors and are only
 logged and reported to userspace if 3 executive tries fail.
-Add rudimentary suspend / resume support, this protects the uguru and
 the driver against suspend / resume cycles, so there is no reason to unload
 the driver in your suspend / resume scripts.
-Fix a very nasty bug in the bank1 sensor type detection code, where it
 would not restore the original settings in any of the error paths!
-Since not successfully restoring the original settings can seriously
 confuse the system BIOS (hang when entering the relevant setup menu), we
 now try restoring them 3 times before giving up.
-On successfull dectection of bank1 sensor type mask out any bits invalid
 for this type from the sensors original settings before restoring them.
 This should restore any damage done by the bug mentioned above, and should
 also repair any damage done in the highly unlikely scenario of the original
 sensor settings restoring failing.

Regards,

Hans

-------------- next part --------------
A non-text attachment was scrubbed...
Name: abituguru.c
Type: text/x-csrc
Size: 53013 bytes
Desc: not available
Url : http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20060809/0a222b95/attachment.bin 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff
Url: http://lists.lm-sensors.org/pipermail/lm-sensors/attachments/20060809/0a222b95/attachment.pl 


[Index of Archives]     [Linux Kernel]     [Linux Hardware Monitoring]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux