Hi, I have the machine back online. /dev/sda4 (the partition that crashed) recovered in about 2 seconds with the only e2fsck output being "recovering journal", so I am running with it. Here are more details on the machine: [*ROOT* mofo /home/mgh 23 ] cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1800+ stepping : 2 cpu MHz : 1534.037 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 3060.53 [*ROOT* mofo /home/mgh 24 ] uname -a Linux mofo 2.4.20 #14 Wed Mar 19 16:48:34 CST 2003 i686 unknown [*ROOT* mofo /home/mgh 25 ] df Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda1 8064272 4530996 3123624 60% / /dev/hda3 29387900 1485288 26409772 6% /home none 127884 0 127884 0% /dev/shm /dev/sda3 151195204 138014604 5500328 97% /mnt/sda3 /dev/sda4 193010776 75844724 107361584 42% /mnt/sda4 /dev/sda1 33032196 27801288 3552924 89% /mnt/sda1 [*ROOT* mofo /home/mgh 26 ] mount /dev/hda1 on / type ext3 (rw) none on /proc type proc (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) /dev/hda3 on /home type ext3 (rw) none on /dev/shm type tmpfs (rw) /dev/sda3 on /mnt/sda3 type ext2 (rw) /dev/sda4 on /mnt/sda4 type ext3 (rw) /dev/sda1 on /mnt/sda1 type ext2 (rw) [*ROOT* mofo /home/mgh 27 ] cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 261910528 249167872 12742656 0 25636864 86761472 Swap: 1052827648 15437824 1037389824 MemTotal: 255772 kB MemFree: 12444 kB MemShared: 0 kB Buffers: 25036 kB Cached: 80408 kB SwapCached: 4320 kB Active: 133188 kB Inactive: 91704 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 255772 kB LowFree: 12444 kB SwapTotal: 1028152 kB SwapFree: 1013076 kB [*ROOT* mofo /home/mgh 30 ] lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8367 [KT266] 00:01.0 PCI bridge: VIA Technologies, Inc. VT8367 [KT266 AGP] 00:09.0 Communication controller: Cyclades Corporation PC300 TE 2 (rev 01) 00:0b.0 SCSI storage controller: Adaptec AIC-7881U 00:0d.0 Ethernet controller: Bridgecom, Inc: Unknown device 0985 (rev 11) 00:0f.0 Ethernet controller: Bridgecom, Inc: Unknown device 0985 (rev 11) 00:10.0 VGA compatible controller: Silicon Integrated Systems [SiS] 82C204 (rev 21) 00:11.0 ISA bridge: VIA Technologies, Inc.: Unknown device 3147 00:11.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06) 00:11.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 23) 00:11.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 23) sda is an external Belkin RAID on an Adaptec 2940: Apr 17 23:56:47 mofo kernel: SCSI subsystem driver Revision: 1.00 Apr 17 23:56:47 mofo kernel: scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8 Apr 17 23:56:47 mofo kernel: <Adaptec 2940 Ultra SCSI adapter> Apr 17 23:56:47 mofo kernel: aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs Apr 17 23:56:47 mofo kernel: Apr 17 23:56:47 mofo kernel: Vendor: BellStor Model: Rev: Apr 17 23:56:47 mofo kernel: Type: Direct-Access ANSI SCSI revision: 02 Apr 17 23:56:47 mofo kernel: (scsi0:A:3): 40.000MB/s transfers (20.000MHz, offset 8, 16bit) Apr 17 23:56:47 mofo kernel: scsi0:A:3:0: Tagged Queuing enabled. Depth 253 Apr 17 23:56:47 mofo kernel: Attached scsi disk sda at scsi0, channel 0, id 3, lun 0 Apr 17 23:56:47 mofo kernel: SCSI device sda: 1073723392 512-byte hdwr sectors (549746 MB) Apr 17 23:56:48 mofo kernel: sda: sda1 sda2 sda3 sda4 [*ROOT* mofo /usr/src 201 ] fdisk /dev/sda The number of cylinders for this disk is set to 66836. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sda: 255 heads, 63 sectors, 66836 cylinders Units = cylinders of 16065 * 512 bytes Device Boot Start End Blocks Id System /dev/sda1 1 4178 33559753+ 83 Linux /dev/sda2 4179 23301 153605497+ 83 Linux /dev/sda3 23302 42424 153605497+ 83 Linux /dev/sda4 42425 66836 196089390 83 Linux Command (m for help): q [*ROOT* mofo /usr/src 202 ] cat /etc/fstab LABEL=/ / ext3 defaults 1 1 none /dev/pts devpts gid=5,mode=620 0 0 LABEL=/home /home ext3 defaults 1 2 none /proc proc defaults 0 0 none /dev/shm tmpfs defaults 0 0 /dev/hda2 swap swap defaults 0 0 /dev/fd0 /mnt/floppy auto noauto,owner,kudzu 0 0 /dev/sda1 /mnt/sda1 ext2 noauto 0 0 /dev/sda2 /mnt/sda2 ext2 noauto 0 0 /dev/sda3 /mnt/sda3 ext2 noauto 0 0 /dev/sda4 /mnt/sda4 ext3 noauto 0 0 /dev/cdrom /mnt/cdrom iso9660 noauto,owner,kudzu,ro 0 0 The kernel is a minimal 2.4.20 with the freeswan 1.99 patch applied. not that it could not be related, but i have been running freeswan since 1999 on 40+ machines in various kernels without any problem. I also applied the pc300-3.4.7 patch to support the Cyclades PC300 T1 card. otherwise the kernel is as stripped down as I could make it. CPU option is "(Athlon/Duron/K7) Processor family" modules support disabled, everything compiled in statically There are no scsi or other hardware errors surrounding the kjournald crash (or ever). After kjournald crashed I could run df without it hanging, but an ls on /mnt/sda4 hung as did all other processes hitting it (remote NT machines using Samba). killall -9 smbd never worked, umount /mnt/sda4 reported busy. The load average jumped to about 30 during all this. umount /mnt/sda1 worked but fsck showed it as uncleanly umounted though didnt find any errors. i could not umount /dev/sda3 due to it being busy, but finally did a umount -km /mnt/sda3 which killed my shell and I was unable to login thereafter. Without thinking too much about it, I deleted the file being moved to sda4 when it crashed. Only maillog.MYD had copied over and it showed a size of about 270 MB. As far as the error itself, it looks like /usr/src/linux/fs/jbd/transaction.c:1384 is: J_ASSERT (journal_current_handle() == handle) in fcn journal_stop() though anyone this board is 90 steps ahead of me as to what this aserts. Also, rereading the ext3 FAQ at http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html it looks like 2.4.16 and up should not require the patch. Any other information I can provide please ask and thanks for your help. Mike > Hi, If this is a redundant post I apologize. I am running 2.4.20 on what has been > a very stable Athlon machine for months, tried to move a 2 GB file from an ext2 > partition to an ext3 and kjournald crashed. Here are the last reminants of my > shell scrollback: > > [*ROOT* mofo /mnt/sda1/mysql/fd 641 ] ll oldmail/ > total 2363288 > -rw-rw---- 1 mysql mysql 2147483647 Jan 23 18:04 maillog.MYD > -rw-rw---- 1 mysql mysql 270138368 Jan 23 18:06 maillog.MYI > -rw-rw---- 1 mysql mysql 8910 Mar 22 2002 maillog.frm > [*ROOT* mofo /mnt/sda1/mysql/fd 642 ] df > Filesystem 1k-blocks Used Available Use% Mounted on > /dev/hda1 8064272 4529888 3124732 60% / > /dev/hda3 29387900 1488316 26406744 6% /home > none 127884 0 127884 0% /dev/shm > /dev/sda1 33032196 30162240 1191972 97% /mnt/sda1 > /dev/sda3 151195204 138014604 5500328 97% /mnt/sda3 > /dev/sda4 193010776 75750204 107456104 42% /mnt/sda4 > [*ROOT* mofo /mnt/sda1/mysql/fd 643 ] mv oldmail/* /mnt/sda4/mgh/oldmysqllogs/ > Segmentation fault > [*ROOT* mofo /mnt/sda1/mysql/fd 644 ] > Message from syslogd@mofo at Thu Apr 17 21:40:13 2003 ... > mofo kernel: Assertion failure in journal_stop() at transaction.c:1384: "journal_current_handle() == handle" > > [*ROOT* mofo /mnt/sda1/mysql/fd 644 ] > [*ROOT* mofo /mnt/sda1/mysql/fd 644 ] fg > > Anything accessing /mnt/sda4 hung at this point (smbd among others) and I could > not cleanly shutdown the machine. Finally a umount -km /mnt/sda3 (not sda4) killed lots > of procs, among them sshd and it is game over until a guy gets onsite to hit the reset button. > > I cant access the machine at the moment but this looks like a hot list so I am > posing what I can. It is an Athlon XP 2000+ with 256 MB DDR (no certain on speed, > definitely an athlon XP) running strait 2.4.20 from the bz2 at ftp.kernel.org > w/o module support compiled for Athlon, ext3 compiled in statically, and again this > has been acting as a mysql server for months without a hitch. it is a redhat 7.2 dist > with all the updates as of abotut one month ago installed, less the custom kernel. > The file I was moving as you can see is a 2 GB file, ie. right at the limit of > ext2 capacity, and I am wondering if this is the culprit. > > Here is what was logged before I lost the machine: > > Apr 17 21:40:13 mofo kernel: kernel BUG at transaction.c:1384! > Apr 17 21:40:13 mofo kernel: invalid operand: 0000 > Apr 17 21:40:13 mofo kernel: CPU: 0 > Apr 17 21:40:13 mofo kernel: EIP: 0010:[journal_stop+108/560] Not tainted > Apr 17 21:40:13 mofo kernel: EIP: 0010:[<c0158eec>] Not tainted > Apr 17 21:40:13 mofo kernel: EFLAGS: 00010282 > Apr 17 21:40:13 mofo kernel: eax: 00000063 ebx: 00000001 ecx: 00000009 edx: c831bf44 > Apr 17 21:40:13 mofo kernel: esi: cdcc7a40 edi: c3739e80 ebp: ccd18ec0 esp: c69e9a00 > Apr 17 21:40:13 mofo kernel: ds: 0018 es: 0018 ss: 0018 > Apr 17 21:40:13 mofo kernel: Process mv (pid: 8133, stackpage=c69e9000) > Apr 17 21:40:13 mofo kernel: Stack: c03250a0 c0320f67 c0320d18 00000568 c0327540 00000000 00000000 c3739e80 > Apr 17 21:40:13 mofo kernel: cda5e900 c3739e80 c0152617 c3739e80 00000000 c0158935 cbc83930 00000000 > Apr 17 21:40:13 mofo kernel: c313bc90 cdcc7a40 ca39fec0 ccd18ec0 cda5e900 cc283600 00000007 c013e3ce > Apr 17 21:40:13 mofo kernel: Call Trace: [ext3_dirty_inode+199/256] [journal_get_undo_access+245/288] [__mark_inode_dirty+46/144] [ext3_new_block+112/1936] [journal_cancel_revoke+251/368] > Apr 17 21:40:13 mofo kernel: Call Trace: [<c0152617>] [<c0158935>] [<c013e3ce>] [<c014d370>] [<c015ca9b>] > Apr 17 21:40:13 mofo kernel: [do_get_write_access+1183/1216] [journal_dirty_metadata+398/432] [ext3_do_update_inode+759/896] [ext3_do_update_inode+852/896] [ip_nat_fn+467/480] [ipt_hook+28/32] > Apr 17 21:40:13 mofo kernel: [<c015861f>] [<c0158c8e>] [<c0152117>] [<c0152174>] [<c02cfe53>] [<c02cfb2c>] > Apr 17 21:40:13 mofo kernel: [journal_cancel_revoke+251/368] [do_get_write_access+1183/1216] [tcp_packet+309/336] [journal_get_write_access+55/80] [journal_cancel_revoke+251/368] [do_get_write_access+1183/1216] > Apr 17 21:40:13 mofo kernel: [<c015ca9b>] [<c015861f>] [<c02cbf85>] [<c0158677>] [<c015ca9b>] [<c015861f>] > Apr 17 21:40:13 mofo kernel: [ext3_alloc_block+25/32] [ext3_alloc_branch+85/720] [getblk+40/96] [getblk+57/96] [bread+22/112] [ext3_do_update_inode+759/896] > Apr 17 21:40:13 mofo kernel: [<c014f649>] [<c014f965>] [<c012e778>] [<c012e789>] [<c012e9c6>] [<c0152117>] > Apr 17 21:40:13 mofo kernel: [ext3_do_update_inode+852/896] [do_get_write_access+1183/1216] [ext3_get_branch+83/208] [ext3_get_block_handle+437/688] [do_get_write_access+1183/1216] [create_buffers+97/240] > Apr 17 21:40:13 mofo kernel: [<c0152174>] [<c015861f>] [<c014f7d3>] [<c0150035>] [<c015861f>] [<c012ebd1>] > Apr 17 21:40:13 mofo kernel: [ext3_get_block+89/96] [__block_prepare_write+230/768] [__jbd_kmalloc+39/160] [block_prepare_write+29/64] [ext3_get_block+0/96] [ext3_prepare_write+124/288] > Apr 17 21:40:13 mofo kernel: [<c0150189>] [<c012f126>] [<c015e757>] [<c012f9ad>] [<c0150130>] [<c01505dc>] > Apr 17 21:40:13 mofo kernel: [ext3_get_block+0/96] [generic_file_write+1185/1760] [ext3_file_write+31/176] [sys_write+149/240] [schedule+786/832] [system_call+51/56] > Apr 17 21:40:13 mofo kernel: [<c0150130>] [<c0122b91>] [<c014e13f>] [<c012ce25>] [<c0110222>] [<c0106d83>] > Apr 17 21:40:13 mofo kernel: > Apr 17 21:40:13 mofo kernel: Code: 0f 0b 68 05 18 0d 32 c0 83 c4 14 f6 47 18 04 ba 01 00 00 00 > > Looking at http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html where i found the link to > this list, it says to use ext3-0.0.7a.tar.bz2 which looks like a kernel patch, which I have not > done. The kernel was compiled from the 2.4.20 dist with no ext3 patches. I did install > e2fsprogs-1.32 but no kernel patches. If this is the issue, please just tell me I am an > idiot and I will be gone. I am 99% sure this is not a hardware issue. > > my first priority is getting the machine on its feet along with that partition, whose integrity > i now question. Can I substitute ext2 for ext3 in fstab and mount it as ext2, after ext2 fscking > it? > > If you have a monent to spare any insight on this late good Thursday you are doing me a great favor, > and maybe I have found a legitimate bug here. I should have hte machine online in 30 minutes > if there is more info I can provide. > > Thanks, > Mike > > > > _______________________________________________ > > Ext3-users@redhat.com > https://listman.redhat.com/mailman/listinfo/ext3-users _______________________________________________ Ext3-users@redhat.com https://listman.redhat.com/mailman/listinfo/ext3-users