> From: David Chow <davidchow@xxxxxxxxxxxxxxxx> > Organization: Shaolin Microsystems Ltd. > Subject: Problem on aic7899 with smp kernel > > Dear all, > > My machine consistently locks up after the system has powered up for > around 20minutes. Each time it locks up the console dumps the following > messages. I've read some posting in this lists and saying it is a > hardware problem. However, the problem only exists when booting smp > kernel. To me, it seems it is a problem come from a badly written driver > rather than hardware. My kernel is 2.4.20-18smp and is running on a Tyan > mother board, dual athlon with onbaord aic7899 , running md with RAID-1 > mirroring, Seagate Cheetah 36GB SCSI . The system can only survive with > non-smp kernels. Any help is appreciated. Thanks. > > regards, > David Chow I've been running RedHat 8.0 in SMP mode on a HP/Compaq ProLiant ML370 machine. The machine has a pair of xeons with hyperthreads switched on (to make 4 CPU contexts). It also has an aic7899 SCSI with 4 drives and a 4 way mirror for the / partition (which contains pretty much all of the RedHat system). Install of RH8.0 was no problem. For some reason RH9.0 will not install because it will not read the CDROM but that's another issue. The main problem that I have found is that the / filesystem is steadily getting corrupted, slowly but surely. After about a week of activity it is bad enough to need a reinstall and this is on a 4-way mirror. Kernel is standard RH8.0: Linux version 2.4.18-14smp (bhcompile@xxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)) #1 SMP Wed Sep 4 12:34:47 EDT 2002 Looking at the Compaq web page, they say it all works fine and the only important thing is to use their SP22002.exe erase program and set the BIOS to "linux" mode, no problems here. They also tell you to install a heap of their useless agents so they can hack up your box with proprietary crap but none of that includes any device drivers so it has nothing to do with this problem. Finally I went to the SuSE site any they do have some interesting information which is that you should use the aic7xxx_old.o module for running the SCSI (I like this module better because it boots faster and gives more useful /proc/scsi diagnostics). Switching SCSI modules under RedHat 8.0 is painful because you have to mess with the initrd file in /boot (maybe someone knows a quick way to do this?). Anyhow, I did go into the initrd and put in aic7xxx_old.o and got it to use that when it boots and get everything reinstalled so that my files are OK and so far things are looking good. At this point it is a matter of time whether my files start to corrupt again... Anyone who thinks they are having a similar problem can easily check test with "rpm --verify -a" and they will see a growing list of "5" flags, each day a few more of them. After a while executable files stop executing and then it's game over. Weirdest thing is that even files that never get written to (i.e. /usr/bin and /bin) start to corrupt which doesn't make sense unless some write blocks are going to completely the wrong address. By the way, if anyone is curious about the speed of the ProLiant ML370, the xeons clock at 2.8G which gives thread nearly 5600 bogomips. RAM bandwidth for linear access (i.e. not hopping around) is just under 5G bytes per second while you stay inside the L2 cache (which is 512k) and then drops to just under 4G bytes per second when you hit main RAM. RAM bandwidth for a forward-hopping pattern (e.g. linked list, skip list or similar) is much worse, about 2G bytes per second inside L2 cache and about 128M bytes per second for main RAM. Obviously there is a big penalty in access setup times and I think it speculatively grabs chunks into cache. Since my main application uses mostly chunk access and the chunks are averaging about 256 bytes, the xeon seems pretty good. If your app does a lot of random access in small regions then you will probably be unimpressed. Across the SCSI discs I managed 140M bytes per second sustained write speed but not with the 4-way mirror, I used some raid-0 for that one. Also, the 140M is for a linear file write, its much slower when there is a bit of seeking involved. Not much you can do about head movement. Hope this helps, if anyone else is playing with similar gear... - Tel -- Psyche-list mailing list Psyche-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/psyche-list