well after having carefully read the answer of Kenneth I finally decided to fight EMI, ESD, RFI and all that kind of troubles my disks were mounted with rubber cylinders and thus had NO ground chassis connection. I installed a "ground wire" on every disk (a wire connected on one side to the disk cabinet and on the other side to the chassis). I got less errors. the 2 machines were connected to each other with a small gigabit switch, and I noticed that this switch had NO grounding too (just a small AC/DC converter as power input) then I installed a ground wire on the switch going to one of the chassis (I could also have connected this ground wire to some AC ground plug...) remember that the 2 machines have a 200 Mbps continuous flow for hours to copy all the data from one to the other ! now I have NO MORE ERRORS I can say that it's very likely that all the troubles I had, disks errors, bad blocks, file system corruption, etc, came from the ungrounded switch (and possibly disks). so now I "ground" everything I can ! many thanks to Kenneth for this clue ! hth A 12:44 13/08/2004 -0400, vous avez écrit : >Two cents worth being oblivious to previous discussions in >this thread. >see below in-line. > >> -----Original Message----- >> From: redhat-list-bounces@xxxxxxxxxx >> [mailto:redhat-list-bounces@xxxxxxxxxx]On Behalf Of >Thierry ITTY >> Sent: Friday, August 13, 2004 11:33 AM >> To: redhat-list@xxxxxxxxxx >> Subject: bad blocks... random death >> >> >> this continues discussions about bad disk blocks not >really >> bad and redhat >> 9 dying randomly >> >> we're now a few on this list experiencing various >symptoms >> (dma errors, bad >> blocks on disks, system freeze or death) that look like >> hardware problems. >> after talking together we can now say that those problems >> are pure OS >> problems. > >If all are SMP systems, then perhaps there is a Spinlock >conflict >(multi-cpu contention) problem with the disk driver. >But I doubt that the disk drivers in the kernel have changed >in years. >I am running RH9 on several heavily used scsi based Compaq >multi-cpu machines with no problems. >So based on my experience, I dount believe in a softwrae >issue here. > >> >> the disks with bad blocks work actually fine elswhere (in >my >> case I ran the >> manufacturer low-level diags and no disk had any problem. >> and, ain't it >> very strange that 10 disks get the same problems at the >same >> time ?!!!) > >Not if you have an EMI (electro-magnetic interference) >shielding issue. The drives are fine. >They might be cross >polluting each other ,the cables and/or the controllers with >EMI. >that will corrupt the bit sream between the drives and the >controller and give you errors. > >The heavier you use the drives, the more the >magnetic coils that move the heads are used. Those coils >put out an EMI field. >The more your use the drives, the more consistent that EMI >field is and without good grounding >it "leak" into whatever copper ground path is available >including your drive cables, >power cables, etc. >normally Emi is drained off through the drive's grounds to >the chassis. It's >grounded to the chassis and through the chassis to the >ground line on the power supply to earth. > >check the following if you haven't already as it applies to >your system: > >1) get an electrical outlet tester at your local Home >Depot/Loews et.al > >2) Check the outlets your systems are plugged into. (if you >use non nema 5-15R/5-20R outlets (household type) >then get a tester or electrical testing service in to check >your grounds.) > >3) Make sure you have a good reliable earth ground at the >outlet. If you dont, get it fixed. >You would be surprised at how many outlets dont have valid >earth grounds. >If you are in a commercial building, your data center >outlets should have been installed with >ISOLATED Grounds , that is a separate ground wire between >the power panel and the receptable. >Most commercial electrical uses the metal jacket as a ground >path and that tends to come apart over time >(ie NO MORE GROUND) > >4) Check the power supply - make sure you are not >overloading it past it's rated maximum output. Make sure >that it is grounded to the chassis and to the earth ground. >Normally it grounds the chassis through it's case >but some have separate ground connections, look for ground >screw connections. > >5) If your drives have ground screws or Tabs on them, >connect them to a reliable chassis ground point. >dont assume they have a good ground through the drive >mounting screws. > >6) Use round shielded cables and watch the grounds on them. >If they are single ended grounds on the shields >make sure that the connected end is connected to a valid >ground source. > >7) Grounds are normally single end connected to prevent >ground fault loops, that is, you dont want more than one >ground path here if you can help it. Multiple ground paths >wont help and can hurt under the wrong circumstances. Drives >with ground tabs dont generally ground through the mounting >screws, but check the drive specs. A cable with the shield >connected at both ends is also expecting to ground the >drive, the cable >should be connecting to a ground pin on the drives >interface. > >8) If you have these drives "dense packed" in your chassis, >you might want to consider putting >grounded shields between them if all else fails, grounded >copper plates for example. > >9) Make sure that you route the power cables away from the >drive controller cables within the chassis. > >10) look for ways that EMI could be crossing. > >11) You might just have one really EMI noisy drive. There >are EMI meters that can be used >to measure EMI levels. > >12) You can also be subject to a different wavelength of >radiation knows as RFI , or Radio Frequency >Interference. > > > > >> >> the problem happens on various machines (gigabyte, asus, >> athlon, pentium, >> maxtor, western...). >> >> it seems it is related to high load periods (in my case a >> heavily used file >> server). >> >> we've been advised to change dma disks settings. I tried >> various things (no >> dma at all, forcing mdma0 or udma2). the system behave >> differently (either >> no errors or other errors as dma timeouts), but it's not >> working quite well >> (for example deactivating dma on disks lowers the average >network >> throughput from 50 MB/s to 1.5 !!! almost 40 times slower >!!! >> >> we really need help to investigate this problem which >causes >> io errors and >> fs corruption ! >> >> tia >> >> >> -- >> redhat-list mailing list >> unsubscribe >mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe >> https://www.redhat.com/mailman/listinfo/redhat-list >> > > >-- >redhat-list mailing list >unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe >https://www.redhat.com/mailman/listinfo/redhat-list > > - * - * - * - * - * - * - Bien sûr que je suis perfectionniste ! Mais ne pourrais-je pas l'être mieux ? Thierry ITTY eMail : Thierry.Itty@xxxxxxxxxxxx FRANCE -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list