Re: disk crash leads to I/O errors when accessing partition [FIXED]

Ross Boylan <ross@xxxxxxxxxxxxxxxx> · Thu, 17 Feb 2011 00:05:44 -0800

I'm happy to report that rebooting seemed to clear things up.  There
were some incomplete transactions in both partitions, but fsck is clean
on both now.

I'm still curious about one issue: if a disk crashes, is it reasonable
to expect to be able to recover an encrypted device that had that disk
(or part of the disk) underneath it?

I'm assuming one isn't, e.g., writing to LUKS headers at the time of the
crash.

Ross
On Wed, 2011-02-16 at 22:35 -0800, Ross Boylan wrote:
> SUMMARY
> After a power outage an encrypted partition is inaccessible, as is a
> regular one, and perhaps the disk as a whole.  Is there a way to
> recover, or at least diagnose?  The disk is in an external USB dock.
> 
> I'm hoping a reboot might help, but I'd like advice before I do anything
> that might cause further damage.
> 
> As I investigated and discovered the problem extends beyond the
> encrypted volume this query may be a bit off topic.  On topic, is it
> reasonable to expect the encrypted partition to be recoverable in these
> circumstances?  I'd appreciate off-topic advice as well :)
> 
> Please cc me directly to help me see this even though my mail system is
> broken because of this problem.
> 
> DETAILS
> The physical disk is a Western Digital WD-20EARS 2TB SATA 3GBPS (5400
> RPM) mounted on Unitek SATA HDD Docking Station with Hub Y-1063.  It is
> connected via USB to a Pentium 4 system running linux kernel
> 2.6.26-2-686, Debian GNU/Linux, mostly lenny.
> 
> The disk has 2 partitions with a GPT.  The first partition is a spare;
> the 2nd, larger, one is part of an LVM volume group that includes other
> disks.  One logical volume (LV) serves as the raw partition for a luks
> encrypted device which backs the mail spool.  Another LV serves directly
> as a spare backup area.
> 
> The docking station has surge suppression only; the main computer went
> through the brief power failure without shutting down.  Since then I get
> I/O errors when I attempt to access the encrypted partition:
> Wed Feb 16 04:57:47 PST 2011  Power failure.
> Wed Feb 16 04:57:53 PST 2011  Running on UPS batteries.
> Wed Feb 16 04:58:14 PST 2011  Mains returned. No longer on UPS
> batteries.
> Wed Feb 16 04:58:14 PST 2011  Power is back. UPS running on mains.
> 
> led to
> Feb 16 04:57:46 corn kernel: [59153.907186] sd 2:0:0:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> Feb 16 04:57:46 corn kernel: [59153.907186] end_request: I/O error, dev sdc, sector 345805680
> Feb 16 04:57:46 corn kernel: [59153.907186] ReiserFS: dm-15: warning: zam-7001: io error in reiserfs_find_entry
> Feb 16 04:57:46 corn kernel: [59153.907186] usb 5-3.1: USB disconnect, address 7
> Feb 16 04:57:46 corn kernel: [59153.907186] ReiserFS: dm-15: warning: zam-7001: io error in reiserfs_find_entry
> [last message repeats a lot.]
> Feb 16 04:57:46 corn kernel: [59153.919189] Buffer I/O error on device dm-15, logical block 1591728
> Feb 16 04:57:46 corn kernel: [59154.009858] hub 5-3:1.0: hub_port_status failed (err = -71)
> Feb 16 04:57:46 corn kernel: [59154.009865] hub 5-3:1.0: connect-debounce failed, port 1 disabled
> Feb 16 04:57:46 corn kernel: [59154.010093] hub 5-3:1.0: cannot disable port 1 (err = -71)
> Feb 16 04:57:46 corn kernel: [59154.010108] usb 5-3: USB disconnect, address 4
> Feb 16 04:57:46 corn kernel: [59154.010111] usb 5-3.2: USB disconnect, address 8
> Feb 16 04:57:46 corn kernel: [59154.010308] usblp0: removed
> Feb 16 04:57:46 corn cyrus/master[1966]: process 12320 exited, signaled to death by 7
> Feb 16 04:57:47 corn kernel: [59154.227754] usb 5-4: USB disconnect, address 5
> Feb 16 04:57:47 corn apcupsd[9495]: Power failure.
> Feb 16 04:57:47 corn chipcardd[9196]: devicemanager.c: 3373: Changes in hardware list
> Feb 16 04:57:49 corn kernel: [59157.066031] Buffer I/O error on device dm-15, logical block 7955
> Feb 16 04:57:49 corn kernel: [59157.066031] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.066031] Buffer I/O error on device dm-15, logical block 7956
> Feb 16 04:57:49 corn kernel: [59157.066031] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067067] Buffer I/O error on device dm-15, logical block 7957
> Feb 16 04:57:49 corn kernel: [59157.067072] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067086] Buffer I/O error on device dm-15, logical block 7958
> Feb 16 04:57:49 corn kernel: [59157.067090] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067098] Buffer I/O error on device dm-15, logical block 7959
> Feb 16 04:57:49 corn kernel: [59157.067103] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067116] Buffer I/O error on device dm-15, logical block 7960
> Feb 16 04:57:49 corn kernel: [59157.067120] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067128] Buffer I/O error on device dm-15, logical block 7961
> Feb 16 04:57:49 corn kernel: [59157.067132] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.067140] Buffer I/O error on device dm-15, logical block 7962
> Feb 16 04:57:49 corn kernel: [59157.067144] lost page write due to I/O error on dm-15
> Feb 16 04:57:49 corn kernel: [59157.073972] REISERFS: abort (device dm-15): Journal write error in flush_commit_list
> Feb 16 04:57:49 corn kernel: [59157.073972] REISERFS: Aborting journal for filesystem on dm-15
> Feb 16 04:57:53 corn apcupsd[9495]: Running on UPS batteries.
> 
> Similar errors repeat frequently throughout the day; they did not appear
> before the power failure.
> 
> Diagnostic attempts:
> corn:/# date; /etc/init.d/cyrus2.2 stop   # uses the encrypted partition
> Wed Feb 16 20:55:53 PST 2011
> Stopping Cyrus IMAPd: cyrmaster.
> corn:/# umount /var/spool/cyrus/
> corn:/# # note it is mounted on top of crypto
> corn:/# fsck.reiserfs --check /dev/mapper/cyrspool3 
> reiserfsck 3.6.19 (2003 www.namesys.com)
> 
> Will read-only check consistency of the filesystem
> on /dev/mapper/cyrspool3
> Will put log info to 'stdout'
> 
> Do you want to run this program?[N/Yes] (note need to type Yes if you
> do):Yes
> 
> The problem has occurred looks like a hardware problem. If you have
> bad blocks, we advise you to get a new hard drive, because once you
> get one bad block  that the disk  drive internals  cannot hide from
> your sight,the chances of getting more are generally said to become
> much higher  (precise statistics are unknown to us), and  this disk
> drive is probably not expensive enough  for you to you to risk your
> time and  data on it.  If you don't want to follow that follow that
> advice then  if you have just a few bad blocks,  try writing to the
> bad blocks  and see if the drive remaps  the bad blocks (that means
> it takes a block  it has  in reserve  and allocates  it for use for
> of that block number).  If it cannot remap the block,  use badblock
> option (-B) with  reiserfs utils to handle this block correctly.
> 
> bread: Cannot read the block (2): (Input/output error).
> 
> Aborted
> 
> # next device is an LVM logical volume, unencrypted
> corn:/# umount /dev/daisy/bacula-backup
> corn:/# date; e2fsck /dev/daisy/bacula-backup 
> Wed Feb 16 21:55:05 PST 2011
> e2fsck 1.41.3 (12-Oct-2008)
> e2fsck: Attempt to read block from filesystem resulted in short read
> while trying to open /dev/daisy/bacula-backup
> Could this be a zero-length partition?
> 
> # finally, try the whole disk
> # The volume group that includes it is still active
> # although I've dismounted the 2 LVs based on the PV.
> corn:/# fdisk /dev/sdc
> 
> Unable to open /dev/sdc
> 
> pvscan does not list an sdc, but does show an sdd which can only be the
> external drive.
> 
> BACKGROUND
> The cyrus spool was on sdb originally; it developed hardware problems.
> I'm out of space in the case, and plugs on the UPS, and so I'm migrating
> to sdc which is mounted externally w/o UPS.  A spare copy of my backups
> are also on sdc, though my main backups are not.  I believe I could
> recover the mail spool as of c 5 hours before the power failure if
> necessary.
> 

_______________________________________________
dm-crypt mailing list
dm-crypt@xxxxxxxx
http://www.saout.de/mailman/listinfo/dm-crypt