On Wed, 2008-10-29 at 17:59 -0700, MHR wrote: > On Wed, Oct 29, 2008 at 5:25 PM, Jim Perrin <jperrin@xxxxxxxxx> wrote: > > > > The only issue I've ever seen has been with the on-board fakeraid stuff > > more and more vendors seem to be adding. I've been using SATA disks > > with centos since the early 4.x days without issue, so you have me at > > a bit of a loss here. I'd say if anything it's due to controller > > support, and much of that can be chalked up to what hardware vendors > > are pawning off as 'controllers' these days. > > > > The one problem I've seen and posted here was w.r.t. smartd error > reports showing 2^32 - 1 errors on one of the disks (probably my > system disk) every few minutes. I thought this was more than just a > bit suspicious, since there are only 4,687,500,000 sectors on a 300GB > disk, and the likelihood of having errors on 4,294,967,295 (~92%) of > them is rather slim unless the whole system is crashing a lot (it's > not). It's a Seagate 300GB, so I ran Seagate's SeaTools on it in > lightweight mode, and no problems were reported, which is good because > the disk is only about a year and a half old and has my CentOS root, > swap, boot and home partitions on it. > > I'll dig deeper on this one - sounds fishy to me, too, now.... With my usual jaundiced eye, my first thought is that the fault is not the obvious one. So I suggest temporarily abandoning "The Usual Suspects" (TM) - what a *great* movie. Is it a consistent or sporadic issue? Is the controller an on-board or after-market? If on-board, is the BIOS the latest? Have you checked connections power/data cable connections? The number you mention makes me think of a bad cable (or connections). Any pattern if it's recurring? Temperature steady in the area? If you had a temporary rise/fall in temperature it could have exposed weak connections, micro-fractures in various cables, poor seating of memory, add-in cards, etc. Any other messages, that might be related, in the log file when it happens? I'm wondering if some spurious interrupt might be involved. Have you memtested recently? ISTM that a memory error could "fool" the system. Re-seated the memory? How about the kernel version? On the latest kernel, 2.6.18-92.1.13.el5, I recently got this. ---------------------------------------------------------- Oct 29 07:09:41 centos501 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0. Oct 29 07:09:41 centos501 kernel: Do you have a strange power saving mode enabled? Oct 29 07:09:41 centos501 kernel: Dazed and confused, but trying to continue ----------------------------------------------------------- Never seen before. Only once, so far. I've not yet investigated this. No recent changes to the system since 5.0 but normal yum updates to current 5.2 status. The case cover is off right now though, so it could be some EMI (heh, or an EMP from the recent trash on this list) :-) That's all I can think of ATM but for power from the utility company or marginal power supply in the unit. > > mhr > <snip sig stuff> HTH -- Bill _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos