Nasty ext3 errors 2.4.18

"Glen Cumming" <glen@cumming66.freeserve.co.uk> · Sat, 14 Dec 2002 14:24:00 -0000

Hi

I’ve got serious troubles – I posted a while
back about experiencing ext3 errors using 2.4.18, at the time I put the
problems down to harddisk failure, but these problems
are occurring more and more - not all of our systems are having this problem
but 3 systems have now shown this problem.

The hardware is essentially the same, the only difference is
disk manufacturers but we’ve now seen the problems on several brands of
disk (Maxtor, Seagate, ibm
etc) so I can’t simply put this down to disk failure (motherboard failure
possibly, but 3 different motherboards?)

I’ve included below some of the debug output from the
kernel below, there is a lot of it so I’ve only included the different
types of errors reported (with times when the problems started)

Thu 12/12/02 15:50:37.315  [KMSG:<2>EXT3-fs
error (device ide0(3,10)): ext3_free_blocks: Freeing blocks in system zones -
Block = 128, count = 1]

Fri 13/12/02 23:55:46.383  [KMSG:<4>
<6>attempt to access beyond end of device]

[KMSG:<6>16:05: rw=0,
want=137058900, limit=39230698]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_branches: Read failure, inode=3567502,
block=-1576348012]

[KMSG:<6>attempt to access
beyond end of device]

[KMSG:<6>16:05: rw=0,
want=1724901664, limit=39230698]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_branches: Read failure, inode=3567502,
block=-642516409]

[KMSG:<6>attempt to access
beyond end of device]

<snip>

Fri 13/12/02
23:55:46.411 [KMSG:<2>EXT3-fs error (device
ide1(22,5)): ext3_free_branches: Read failure, inode=3567502,
block=1329885327]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 1874129395, count = 1]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 203477977, count = 1]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 2877124100, count = 1]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 103093662, count = 1]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 3719271906, count = 1]

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone - block = 4274192639, count = 1]

<snip>

[KMSG:<2>EXT3-fs error
(device ide1(22,5)): ext3_free_blocks: bit already cleared for block 9126180]

<snip>

Fri 13/12/02
23:56:00.242 [KMSG:<2>EXT3-fs error (device
ide1(22,5)): ext3_free_blocks: Freeing blocks not in datazone
- block = 3305048842, count = 1]

[KMSG:<0>Assertion failure in
do_get_write_access() at transaction.c:589:
"handle->h_buffer_credits > 0"]

[KMSG:<4>invalid operand:
0000]

[KMSG:<4>CPU:    0]

[KMSG:<4>EIP:    0010:[<c0156d09>]    Not tainted]

[KMSG:<4>EFLAGS: 00010286]

[KMSG:<4>eax:
00000063   ebx:
cefb4ac0   ecx:
c55ac3c0   edx:
fffffffe]

[KMSG:<4>esi:
c7083f00   edi:
00000002   ebp:
c7083f00   esp:
c0605c20]

[KMSG:<4>ds:
0018   es:
0018   ss:
0018]

[KMSG:<4>Process videoexe (pid: 3412, stackpage=c0605000)]

[KMSG:<4>Stack: c0232720
c02328e6 c0232700 0000024d c0232921 cce0f000 ccd24a60 cefb4ac0 ]

[KMSG:<4>       cce0f094 cce0f094 00000000 00000000
cce0f000 cc434760 c01570d8 ccd24a60 ]

[KMSG:<4>       cefb4ac0
00000000 00000000 c7083f00 ccd24a60 c39cf460 c0150798
ccd24a60 ]

[KMSG:<4>Call Trace:
[<c01570d8>] [<c0150798>] [<c01570e0>] [<c01508fc>]
[<c0150b98>] ]

[KMSG:<4>   [<c0150a68>]
[<c0150a68>] [<c0150a68>] [<c0150c79>] [<c0150f0b>]
[<c015763c>] ]

[KMSG:<4>   [<c01516c2>]
[<c0151727>] [<c0121b5f>] [<c011feae>] [<c011ff0d>]
[<c0140567>] ]

[KMSG:<4>   [<c01518d1>]
[<c01406bc>] [<c012cc3d>] [<c013796a>] [<c012dae7>]
[<c012de3a>] ]

[KMSG:<4>   [<c0106b87>] ]

[KMSG:<4>]

[KMSG:<4>Code: 0f 0b 83 c4 14
8b 54 24 28 8b 42 04 48 8b 4c 24 28 89 41 04 ]

^^^

That’s the final nail in the coffin as the process
then locks solid (but still has threads running which then run out of memory –
total chaos ensues) the box has to be powered off/on. The disks fsck when the machine comes back up – no reports of
any hardware IO errors.

The profile of the machine is that its
doing lots of disk IO as its capturing video to disk – there are 3
partitions used, the problem is only occurring on one of them, they are each roughly
40 gig in size.

The only thing to note is that there was an issue where this
partition and another filled up and I had to make space (just by deleting files,
now the maximum used space is around 76%) – once I’d cleaned up the
file-system the system ran fine until the first error was reported at 15:50 on
Thursday (as shown above), and then Friday (last night) it just went haywire.

The only other thing to note is that there was a panic on kswapd a number of hours earlier – but I’ve
seen these on other systems running 2.4.18 and they don’t seem to cause
any problems (I think).

As I’ve mentioned I’ve seen the same behavior
before on other systems, the specs for all of them are:

Abit ST6
Motherboard with 1.2 Gig Celeron

2 x disks (varying sizes and makes)

128Meg Ram

AGP Graphics Card

Ethernet

Bt848 capture cards (2-3 depending on customer)

I’m really pulling my hair out – I don’t
know why they are doing this – these are all on customer sites (they
never go wrong in the office, each one that have gone bad has been in different
environments i.e. warm, cold, no power spikes or anything reported) – and
at the moment as you can imagine we are not flavor of the month so I really
need to come up with a bullet-proof plan (one customer is one his second box,
which did the same as the first after 2 days – it ran in our office for 2
weeks no problems!)

I know I’ve probably not given enough info (sorry I
can’t get a better trace of the panic) – but any help that anyone
can give will really really really
be appreciated.

Thanks,

Glen