Sagar Borikar wrote: >> Can you please post the log which leads to the softreset? The above >> only shows EH's response to the problem not the cause itself. Also, if >> possible, please try more recent kernel. Recent ones have better error >> reporting and will help debugging the problem even if it can't be used >> for production. > > Unfortunatey, we can't upgrade kernel as this is a production system. > And these are the only logs > which I get on the console and dmesg output, nothing else comes from > sata controller. So I am > clueless from where the soft reset is triggered. Eh... That shouldn't happen but ISTR fixing paths where diagnostics weren't reported correctly. :-( You can make it more verbose by editing drivers/ata/libata-eh.c::ata_eh_report(). There are several conditions which make it veil without reporting. Remove them so that it always reports. >> SATA IO errors are much more common than PATA ones probably due to its >> high signal rate. Even on a otherwise perfectly healthy system, SATA IO >> errors occur occasionally (say, once in several months). However, if >> such problems are frequent and regular, it does indicate a problem. One >> of not-so-rare causes for such problems is power or interference >> problem. Using different powre supply and hooking up the harddrive to a >> separate power supply is the easiest way to rule this out. > Understood but the point is we don't see soft reset on other platforms > with different sata controller. So I would guess that it could be a > combination of power suppy > and the sata controller. I'll update you with the results of separate > power supply to drive and system test. Hmmm.. Combination of PSU + SATA controller isn't very likely. :-( Well, please rule it out anyway as the other possibility (silent data corruption) is scarier and much more difficult to track down. >> This momentary power loss (short voltage drop will do the job) is not so >> rare. A few months ago, I tracked down a fs corruption problem on a >> server from a major vendor to this problem and it wasn't a single >> machine. The whole line or production batch was problematic. > >> You can often hear the head doing an emergency unload and then spinning >> back up shortly after. This also increments emergency unload and/or >> start stop count in the smart output, so if those counters increase >> after such IO errors, it's likely that you're experiencing this problem. >> >> Hope it helped. > > Thanks a ton for this detail information. Can you please report full boot log + smartctl -a output before and after an error? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html