disk crash leads to I/O errors when accessing partition

Ross Boylan <ross@xxxxxxxxxxxxxxxx> · Wed, 16 Feb 2011 22:35:44 -0800

SUMMARY
After a power outage an encrypted partition is inaccessible, as is a
regular one, and perhaps the disk as a whole.  Is there a way to
recover, or at least diagnose?  The disk is in an external USB dock.

I'm hoping a reboot might help, but I'd like advice before I do anything
that might cause further damage.

As I investigated and discovered the problem extends beyond the
encrypted volume this query may be a bit off topic.  On topic, is it
reasonable to expect the encrypted partition to be recoverable in these
circumstances?  I'd appreciate off-topic advice as well :)

Please cc me directly to help me see this even though my mail system is
broken because of this problem.

DETAILS
The physical disk is a Western Digital WD-20EARS 2TB SATA 3GBPS (5400
RPM) mounted on Unitek SATA HDD Docking Station with Hub Y-1063.  It is
connected via USB to a Pentium 4 system running linux kernel
2.6.26-2-686, Debian GNU/Linux, mostly lenny.

The disk has 2 partitions with a GPT.  The first partition is a spare;
the 2nd, larger, one is part of an LVM volume group that includes other
disks.  One logical volume (LV) serves as the raw partition for a luks
encrypted device which backs the mail spool.  Another LV serves directly
as a spare backup area.

The docking station has surge suppression only; the main computer went
through the brief power failure without shutting down.  Since then I get
I/O errors when I attempt to access the encrypted partition:
Wed Feb 16 04:57:47 PST 2011  Power failure.
Wed Feb 16 04:57:53 PST 2011  Running on UPS batteries.
Wed Feb 16 04:58:14 PST 2011  Mains returned. No longer on UPS
batteries.
Wed Feb 16 04:58:14 PST 2011  Power is back. UPS running on mains.

led to
Feb 16 04:57:46 corn kernel: [59153.907186] sd 2:0:0:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Feb 16 04:57:46 corn kernel: [59153.907186] end_request: I/O error, dev sdc, sector 345805680
Feb 16 04:57:46 corn kernel: [59153.907186] ReiserFS: dm-15: warning: zam-7001: io error in reiserfs_find_entry
Feb 16 04:57:46 corn kernel: [59153.907186] usb 5-3.1: USB disconnect, address 7
Feb 16 04:57:46 corn kernel: [59153.907186] ReiserFS: dm-15: warning: zam-7001: io error in reiserfs_find_entry
[last message repeats a lot.]
Feb 16 04:57:46 corn kernel: [59153.919189] Buffer I/O error on device dm-15, logical block 1591728
Feb 16 04:57:46 corn kernel: [59154.009858] hub 5-3:1.0: hub_port_status failed (err = -71)
Feb 16 04:57:46 corn kernel: [59154.009865] hub 5-3:1.0: connect-debounce failed, port 1 disabled
Feb 16 04:57:46 corn kernel: [59154.010093] hub 5-3:1.0: cannot disable port 1 (err = -71)
Feb 16 04:57:46 corn kernel: [59154.010108] usb 5-3: USB disconnect, address 4
Feb 16 04:57:46 corn kernel: [59154.010111] usb 5-3.2: USB disconnect, address 8
Feb 16 04:57:46 corn kernel: [59154.010308] usblp0: removed
Feb 16 04:57:46 corn cyrus/master[1966]: process 12320 exited, signaled to death by 7
Feb 16 04:57:47 corn kernel: [59154.227754] usb 5-4: USB disconnect, address 5
Feb 16 04:57:47 corn apcupsd[9495]: Power failure.
Feb 16 04:57:47 corn chipcardd[9196]: devicemanager.c: 3373: Changes in hardware list
Feb 16 04:57:49 corn kernel: [59157.066031] Buffer I/O error on device dm-15, logical block 7955
Feb 16 04:57:49 corn kernel: [59157.066031] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.066031] Buffer I/O error on device dm-15, logical block 7956
Feb 16 04:57:49 corn kernel: [59157.066031] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067067] Buffer I/O error on device dm-15, logical block 7957
Feb 16 04:57:49 corn kernel: [59157.067072] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067086] Buffer I/O error on device dm-15, logical block 7958
Feb 16 04:57:49 corn kernel: [59157.067090] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067098] Buffer I/O error on device dm-15, logical block 7959
Feb 16 04:57:49 corn kernel: [59157.067103] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067116] Buffer I/O error on device dm-15, logical block 7960
Feb 16 04:57:49 corn kernel: [59157.067120] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067128] Buffer I/O error on device dm-15, logical block 7961
Feb 16 04:57:49 corn kernel: [59157.067132] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.067140] Buffer I/O error on device dm-15, logical block 7962
Feb 16 04:57:49 corn kernel: [59157.067144] lost page write due to I/O error on dm-15
Feb 16 04:57:49 corn kernel: [59157.073972] REISERFS: abort (device dm-15): Journal write error in flush_commit_list
Feb 16 04:57:49 corn kernel: [59157.073972] REISERFS: Aborting journal for filesystem on dm-15
Feb 16 04:57:53 corn apcupsd[9495]: Running on UPS batteries.

Similar errors repeat frequently throughout the day; they did not appear
before the power failure.

Diagnostic attempts:
corn:/# date; /etc/init.d/cyrus2.2 stop   # uses the encrypted partition
Wed Feb 16 20:55:53 PST 2011
Stopping Cyrus IMAPd: cyrmaster.
corn:/# umount /var/spool/cyrus/
corn:/# # note it is mounted on top of crypto
corn:/# fsck.reiserfs --check /dev/mapper/cyrspool3 
reiserfsck 3.6.19 (2003 www.namesys.com)

Will read-only check consistency of the filesystem
on /dev/mapper/cyrspool3
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you
do):Yes

The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (2): (Input/output error).

Aborted

# next device is an LVM logical volume, unencrypted
corn:/# umount /dev/daisy/bacula-backup
corn:/# date; e2fsck /dev/daisy/bacula-backup 
Wed Feb 16 21:55:05 PST 2011
e2fsck 1.41.3 (12-Oct-2008)
e2fsck: Attempt to read block from filesystem resulted in short read
while trying to open /dev/daisy/bacula-backup
Could this be a zero-length partition?

# finally, try the whole disk
# The volume group that includes it is still active
# although I've dismounted the 2 LVs based on the PV.
corn:/# fdisk /dev/sdc

Unable to open /dev/sdc

pvscan does not list an sdc, but does show an sdd which can only be the
external drive.

BACKGROUND
The cyrus spool was on sdb originally; it developed hardware problems.
I'm out of space in the case, and plugs on the UPS, and so I'm migrating
to sdc which is mounted externally w/o UPS.  A spare copy of my backups
are also on sdc, though my main backups are not.  I believe I could
recover the mail spool as of c 5 hours before the power failure if
necessary.

_______________________________________________
dm-crypt mailing list
dm-crypt@xxxxxxxx
http://www.saout.de/mailman/listinfo/dm-crypt