Re: Recent kernel hosing partition

For Junk Mail <junk_mail@xxxxxxxxxxxxxxxxxx> · Tue, 11 Dec 2007 10:19:04 +0000

On Tue, 2007-12-11 at 10:47 +0900, Tejun Heo wrote:
> Hello,
> 
[snip]
> 
> AFAIK, there currently isn't any known problem specific to VIA - Seagate
> combination.  sata_via surely has some issues on error conditions tho.

>From previous incarnations of the via chipset I've had errors on dma,
drive 'ringing' (where access/copying to hdb wakes up hda which says
"What's going on?" and confuses everything) from Seagate drives. One M/B
sat down and refused to work with 2 hard disks on the same ribbon. Maybe
I'm just one disenchanted luser but I had the logs to prove it in the
crashtesting days and they were examined by Mandrake's guys.
> 
> >>> Another issue here is that the old ide driver could get through the
> >>> mess, whereas the newer one cannot. I get "Drive reset: success" and the
> >>> old ide driver recovers, whereas the new one goes out to lunch. The log
> >>> snippets show a 60 seconds gap between errors. That's a 60 second freeze.
> >> Hmmm...
> >>
> >> 1. So, the IDE driver suffers from error conditions too?  Do you have
> >> logs around?
> >>
> 
> I meant the old driver/ide/* drivers.
> 
/checks every distro
YES! I have logs of errors with the old ide driver. When Fedora 7 went
out to lunch, I was embarassed for a kernel for my (previous) fedora 5,
and ended up using e2fsck from a uClibc based experimental distro from

http://kevux.org/

It has e2fsck-1.40.2, and some weird alternative log system. I'll send
the appropriate log privately as well as Fedora's log. Logs are dated.
The last errors in Kevux will correspond to a time shortly
after /usr/lib/firefox went missing in Fedora 7, as I went from one to
the other to sort the disk out. Do you understand me? 

I should be very clear. These errors occurred using the old driver on
hda3(sda3) while dealing with errors _caused_ by what you are trying to
investigate. Fedora 7 also had /dev/sda5 mounted as /home, and /dev/sda1
as /boot and not one error occurred on either of those. I checked the
whole disk with e2fsck at some points, and everything was fine.
Filesystems were modified, but nothing came to lost+found, or nothing
was corrupted to my knowledge except on sda3.

What upset me personally, btw, is that nobody in RedHat/Fedora gave an
<expletive deleted>. When you're finished, Slackware is going in
there :-D

> >> 2. Do you have logs of libata driver goes out to lunch?
> >>
> > Catch 22. Did you see the film? I've only one hard disk. Reset to get
> > out of trouble, so how does it log the disk going out to lunch?. Where
> > would I log it to?
> 
> Ah.. Catch 22 is name of a film.  I knew what it meant but never knew
> where the expression came from.  Anyways, in such cases, log is usually
> collected via serial or net console, usb or other storage if you have
> quasi working userland or digital cameras as a last resort.

Have you a doc on setting up such a log somewhere? I'll set one up. As
long as it doesn't queue in the ide cache. BTW, Catch-22 was also a
book, which I read. It was full of army tales. You didn't miss much,
imho. Knowing what it means is enough.
> 
[snip]
> > Typically, in an 'out to lunch' period, the line beginning 'exception
> > Emask' down as far as 'DPO or FUA' would repeat on stdout. Some disk
> > error would precede it, e.g. '/usr/lib/something.so: no such file or
> > directory'. That file would probably migrate to lost+found on the next
> > e2fsck pass and when I went to check it 2 reboots later it was indeed
> > missing. Then we got to the stage where the
> > entire /usr/lib/firefox<version>/  directory migrated and we departed
> > from reality at that point.
> 
> Ah... I'd really like to see the log.

Sadly, there wasn't one. The box froze in X. I hit Ctrl_Alt_F1. I saw
/usr/lib/firefox-2.0.0.9/firefox-bin: No such file or directory
Followed by the error (Emask ... --> DPO or FUA)
e2fsck found illegal inodes, loose inodes, inodes claimed by 2 programs,
counts all over the place. It restarted itself after stage 2, and I
nearly blew a gasket because stage1 had the badblocks option set :-(. I
saw A, B, & C to some of these 5 stages that I never saw before. I'll
privately send you the /var/log/messages in it's entirety, which is all
the Fedora 7 recorded data. I know linux-ide will bounce it. The _last_
set of errors in the file will be that time
when /usr/lib/firefox-2.0.0.9/ went awol.

Subsequent to that outage I compiled binutils, uClibc, installed linux
headers, and finally crashed out on a repeatable error in compiling gcc
using somebody's scripts in Fedora 7. But I couldn't run X, because
gnome and every X program was borked by this error. I'd get X (the grey
screen) and then things went sadly wrong in gnome.

> 
> > If we can provoke the error, I feel the way to trap it is
> > 1. make intelligent recoverable changes to ide partition /dev/sda3 on
> > firefox files.
> > 2. Directly or indirectly, Mount my 1 gig usb disk on /var/log :-D.
> > Would that get around the Catch-22? I can stick in another (old) disk if
> > needed, but I only have ide, and we freeze, so that will hardly be much
> > good.
> 
> Usually the best way is serial or net console.

Have you a reference, or a doc on doing that? I'll set it up.

> 
> There are other reports of sata_via freezing up after transport errors
> and sadly there isn't too much to do about it.  The controller hangs
> while holding the PCI bus and no software can recover from that.  I'm
> currently not sure whether the controller locks up on transmission
> errors or as a response to libata's error handling sequence.  If latter,
> we may be able to avoid it by changing EH sequence but unfortunately I
> don't have access to affected hardware or time at the moment.

Here Via has one step up (or down) from everybody because PCI and IDE
are split in the Southbridge, and the 2 are not linked. I have the
datasheet to prove it. So it's freezing further back. I've worked in
electronic hardware and I see 2 problems

1. The error condition reading the filesystem for whatever reason (In my
case, linked to some X program). 
2. The soft reset libata provides doesn't sort things out. The drive
reset provided by the old ide driver seemed to sort it out. 
> 
> What worries me is that your case actually resulted in data corruption.
>  libata's EH is safe.  Another possibility is that your filesystem got
> corrupted while going through several lockup - reboot sequences in which
> case data sure is lost.  But still journaling and barrier should be able
> to avoid filesystem corruption.  You have barrier enabled, right?

I really don't know if barrier is enabled. If you tell me how I can
check it. journalling is on the same partition, but as we froze, and
apparently did more damage as things went on, I was quick to reset. That
effectively reduces it to ext2. But I was also quick to check the whole
partition (Because I couldn't boot otherwise).

-- 
For Junk Mail <junk_mail@xxxxxxxxxxxxxxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html