On Tue, Nov 17, 2009 at 02:47:24PM +0900, Tejun Heo wrote: > Hello, > > Can you please cc linux-ide@xxxxxxxxxxxxxxx? Absolutely, didn't know it was good for PMP too. Done. > > Nov 2 17:03:17 gargamel kernel: ata6.15: exception Emask 0x100 SAct 0x0 SErr 0x200000 action 0x6 frozen > > Nov 2 17:03:17 gargamel kernel: ata6.15: irq_stat 0x02060002, PMP DMA CS errata > > Command execution error reported. > > Sil3124/32 has an errata which worsens PMP error handling quite a bit. > It's DMA context gets corrupt if a failure occurs when commands are in > flight to 3 or more commands, so the driver has to abort all commands > immediately. gotcha > This is the actual failure. Your 6.02 drive reported media error > which combined with the controller errata caused port wide failure. Ah, I see, so it should be the one for me to focus on. If it hadn't had an error, everything wouldn't have gone down the toilet, next, right? scsi 6:2:0:0: Direct-Access ATA Hitachi HDS72101 GKAO PQ: 0 ANSI: 5 sd 6:2:0:0: [sdj] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB) If it's a media error, shouldn't it show up in the smart counters? === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 7K1000 Device Model: Hitachi HDS721010KLA330 Serial Number: GTJ000PAG2JLKC Firmware Version: GKAOA70F User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is: Tue Nov 17 09:32:47 2009 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 130 130 054 Pre-fail Offline - 150 3 Spin_Up_Time 0x0007 105 105 024 Pre-fail Always - 662 (Average 662) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 179 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline - 33 9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 18566 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 92 192 Power-Off_Retract_Count 0x0032 061 061 000 Old_age Always - 47436 193 Load_Cycle_Count 0x0012 061 061 000 Old_age Always - 47436 194 Temperature_Celsius 0x0002 125 125 000 Old_age Always - 48 (Lifetime Min/Max 20/63) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 359 > The device gets kicked out of the system so the errors follow. I have > no idea why ata6.00 decided to stop responding. It might be a > firmware bug or the PMP is malfunctioning. If this happens again, you > can verify that by detaching the offending drive from the PMP without > disconnecting power (the drive stays powered up) and then connect it > in a different port and see whether it works. If it doesn't, it means > the firmware on the drive is firmly hung and will require power cycle > to get working again. Earlier SATA drives and few of recent ones > sometimes do this after certain failures. I can't really move it to another PMP port but I have indeed had failures that required not just a reboot of my server but an actual power cycle of the drive. > Anyways, if my guess is right, the sequence of the event is first the > drive with bad sector led to EH kicking in abruptly due to controller > errata, which in turn caused another drive to lock up due to its > firmware problem. Ok, so this all sounds like it's a bit fragile due to hardware issues :) I now have to figure out if /dev/sdj has a bad sector or not. Last time I had this happen, though I did run dd if=/dev/drive of=/dev/null bs=1M for my 5 drives, and it ran clean. If I had a bad sector, shouldn't it show up in Current_Pending_Sector and shouldn't reading the entire drive with dd fail? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html