Re: Infrequent soft reset of ata for silicon image 3512 cards

Tejun Heo <tj@xxxxxxxxxx> · Fri, 01 Aug 2008 13:14:40 +0900

Sagar Borikar wrote:
> I hope this is the right list for following questions if not please
> direct me to the correct one.
> 
> Currently  I am working with NAS box which has following configuration:
> 
> MIPS arch
> 2.6.18 kernel - comparatively older but box is in production

Ah... it's a bit too old at this point.

> 128 MB RAM
> sil 3512 SATA controller
> xfs file system
> 
> When performing the iozone stress test of the box over CIFS, NFS
> simultaneously, I find that the ata port gets soft reset once in 5-8
> hours and because of which the the continuous write activity gets
> stalled on the drives. All the smbd processes which are writing data
> to the disk goes into uninterruptilbe sleep state continuosuly and the
> test doesn't complete.
> 
> Following is the log that I get :
> 
> ata1: soft resetting port
> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> ata1.00: configured for UDMA/100
> ata1: EH complete
> SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
> sda: Write Protect is off
> SCSI device sda: drive cache: write back

These only report the actions took by EH to recover from an error
condition.  Is there any message before this?

> After this, I start getting errors from file system :
> 
> can't seek in filesystem at bb 10686861057857128
> can't read btree block 1630685585/1000141
> can't seek in filesystem at bb 8951363201349912
> can't read btree block 1365869628/911139
> can't seek in filesystem at bb 5768064121399776
> can't read btree block 880136736/1043772
> 
> Which looks like filesystem is trying to read the block  which is not
> present in the partition.
> and because of which device driver cribs that it is trying to access
> the data beyond end of the device.
> 
> So I guess there is filesystem corruption too which can be solved
> independently but ata1 getting soft reset under load is something
> strange. Has anyone observed this before with silicon image 3512
> cards?

Yeah, it looks like fs corruption.  There have been a few reports of
data corruption on 3512 when combined with certain chipsets but they
didn't involve time outs or any other error conditions.

One common way to trigger data corruption is to briefly disconnect power
and reapply it.  All the data in the cache will get lost and the driver
has no way whether it lost any data or not, so all hell breaks loose.
Similar situations do occur on running systems if the power supply can't
maintain voltage for whatever reason.  Things like this usually occur
when a harddrive is plugged in (as the new one sucks in power to spin
up, existing ones suffer voltage drop) but I've seen it happening
without such event under heavy IO load.

Ruling it out is easy.  Just prepare a separate power supply and connect
the harddrive (only the harddrive) to it and see whether the problem
disappears.  You can power up an ATX PSU w/o motherboard easily.

  http://modtown.co.uk/mt/article2.php?id=psumod

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html