Re: MD/RAID time out writing superblock

Mark Lord <liml@xxxxxx> · Mon, 14 Sep 2009 10:25:16 -0400

Tejun Heo wrote:
Mark Lord wrote:
Tejun Heo wrote:
..
Oooh, another possibility is the above continuous IDENTIFY tries.
Doing things like that generally isn't a good idea because vendors
don't expect IDENTIFY to be mixed regularly with normal IOs and
firmwares aren't tested against that.  Even smart commands sometimes
cause problems.  So, finding out the thing which is obsessed with the
identity of the drive and stopping it might help.
..

Bullpucky.  That sort of thing, specifically with IDENTIFY,
has never been an issue.

With SMART it has.  I wouldn't be too surprised if some new firmware
chokes on repeated IDENTIFY mixed with stream of NCQ commands.  It's
just not something people (including vendors) do regularly.
..

Yeah, some drives really don't like SMART commands (hddtemp & smartctl).
That's a strange one, too.  Because the whole idea of SMART
is that it gets used to periodically monitor drive health.

IDENTIFY is much safer -- usually no media access after initial spin-up,
and lots of things exercise it quite regularly.

Pretty much any hdparm command triggers an IDENTIFY beforehand now,
hddtemp and smartctl both use it too.

I suspect we're missing some info from this specific failure.
Looking back at Chris's earlier posting, the whole thing started
with a FLUSH_CACHE_EXT failure.  Once that happens, all bets are
off on anything that follows.

Everything will be running fine when suddenly:

  ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
  ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
          res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
  ata1.00: status: { DRDY }
  ata1: hard resetting link
  ata1: softreset failed (device not ready)
  ata1: hard resetting link
  ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  ata1.00: configured for UDMA/133
  ata1: EH complete
  end_request: I/O error, dev sda, sector 1465147272
  md: super_written gets error=-5, uptodate=0
  raid10: Disk failure on sda3, disabling device.
  raid10: Operation continuing on 5 devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html