Hi, Thanks for the reply. I received another response directly and want to include my reply here: > > Did you use the ext3 patches for the vanilla 2.4.20? > > > > http://www.zip.com.au/~akpm/linux/ext3/ > > ... > > I have not applied those patches, just bare > 2.4.20 with the cyclades and ipsec patches. It will be difficult to do > since the machine is in a colocation cabinet and will be down at least > one hour should something go wrong. Do you recommend I stick with ext2 > for now? Is the switchover really as easy as it sounds? I love the quick > recovery times in ext3 but since it is a production machine with millions of > users' files I cant manage applying kernel updates regularly, as cop out as > that may sound... In this case do you recommend I just stick with ext2? > Any comments? > > mofo kernel: Assertion failure in journal_stop() at transaction.c:1384: "journal_current_handle() == handle" > > Odd. That is one assert failure I have _never_ seen reported. Handle > mismatches in journal_start have happened from time to time when there > has been illegal recursion in the VM, but not in journal_stop. > > The most likely cause would seem to me to be a stack overflow --- the > per-process field which holds the journal handle is right at the end of > the task struct, so it's one of the first fields to be clobbered in the > event of a stack overflow. > > If that has happened, it's not due to ext3 --- the stack here isn't > close to being that sort of size --- but it's entirely possible that > there were IRQ routines operating during the function which overflowed > the stack. > > In particular, we've seen that happen before with heavy network > activity, especially with multiple NICs, because the random sampling > that occurs for /dev/random during NIC activity was a heavy stack user. > > There's a patch to address that in the very latest Marcelo kernel > trees. It reduces the stack usage of the random sampling by several > hundred bytes. The fix is in the 2.4.21-pre7 kernel. No surprise you have not seen it. I found nothing about it on Google groups searching with various strings. The whole issue is possibly moot since I do not have the patch mentioned above applied to the kernel. I think my machine matches your profile for heavy network activity, with two fast ethernet interfaces and a T1 link. There is a lot piled on the machine at the moment as we are in the process of migrating colocations and are short a few machines. At the moment this fine server is acting as a firewall, a zebra/bgp router on two interfaces, a mysql server, a file server to NT clients (smbd), and it runs some heavy cpu processes for document conversions. Granted that is a lot of stuff but at the time of the crash the machine was not doing much, other than copying the file. His network activity will peak at about 2 Mb/sec on one ethernet and 1.5 Mb/sec on his T1 and run all day like this. Though there could have been a burst in activity overall it was very quiet at the time from network and processor load standpoints. I am not totally clear on the network load theory. Is the succeptibility to stack overflow something particular to kjournald, or is it that the network load could cause a crash in any kernel process? Could it crash a non-kernel process, like mysqld? This really is the first time the machine has crashed, and I have seen it run smoothly every afternoon at load averages of 5 or 6 during peak periods and saturating its T1. Relatively the fast ethernet loads are not very high. I am a little skeptical of this since we have been running a dozen servers with similar setups (less the T1 interface, ie. mysql, samba, external scsi raid, firewalling on 2 interfaces) for several years with ext2 and have never seen a crash that was not due to hardware failure. Granted, we are not running kjournald to crash :) Its just that I have never seen network loading cause a crash in much more heavily loaded servers. It could be some external interference that caused the stack overflow but the network activity was really low. Maybe the scsi activity? I might also suspect the pc300 driver as I have not used the card before, but then again 1.545 Mbps with 1500 kB packets doesn't create many ISRs. One note on the scsi activity... With the machine online one night 2 weeks back I copied about one million files consuming about 60 GB from one of the ext2 scsi partitions to sda4 (ext4) without a problem. > > The file I was moving as you can see is a 2 GB file, ie. right at the limit of > > ext2 capacity, and I am wondering if this is the culprit. > > No, ext2/3 can both operate beyond 2GB quite safely. > ext2 definitely can only handle files less than 2GB in size. If you script something to write past this limit to a file (or, ehem forget to truncate a large table and mysqld does it for you) you will see the file gets to 2147483647 bytes and any new writes will block or fail. This is the size of the file I was copying. So getting back to my self interests here... what would you recommend I do? It sounds like you believe using ext2 will not improve things, ie. that the network ISR actvity is a likely culprit. Should I try the patches here http://www.zip.com.au/~akpm/linux/ext3/ or try 2.4.21-pre7? In the latter should I still apply the 2.4.20 patches? Also, do you have a ballpark figure on how time consuming it would be to convert my ext2 partions to ext3, with them unmounted? One is 150 GB and the other 190 GB. each partition has between 500k and 3 MM files across maybe 15 directories, if thats a factor. Are we talking 20 minutes or 5+ hours? ext2 fscks on the 150 GB partitions can take 4 hours. I may opt for using ext2 for now and switching back to ext3 when I can physically mess with the server to do the kernel updates, as much as I hate to do that. The uptime benefits of ext3 are too good to ignore. I really appreciate your help. Mike _______________________________________________ Ext3-users@redhat.com https://listman.redhat.com/mailman/listinfo/ext3-users