Bad Blocks

Dyweni - Ceph-Devel <YS3fpFE2ykfB@xxxxxxxxxx> · Wed, 20 Mar 2013 13:55:49 -0500

Hi All,

I would like to understand how Ceph handles and recovers from bad 
blocks.  Would someone mind explaining this to me?  It wasn't very 
apparent from the docs.

My ultimate goal to be able to get some extra life out of my disks, 
after I detect that they may be failing.  (I'm talking about those disks 
that may have a small amount of bad blocks, but otherwise seem file and 
still perform well).

Here's what I've put together:

1. BBR Hardware
    - All hard disks come with a set number of blocks that are reserved 
for remapping of failed blocks.  This is handled transparently by the 
hard disk.  The hard disk may not begin reporting failed blocks until 
all the reserved blocks are used up.

2. BBR Device Mapper Target
    - Back in the EVMS days, IBM wrote a kernel module (dm-bbr) and a 
evms plugin to manage that kernel module.  I have updated that kernel 
module to work with the 3.6.11 kernel.  I have also rewrote some 
portions of the evms plugin as a standalone bash script to allow me to 
initialize the BBR layer and start the BBR device mapper target on that 
layer.  (So far it seems to run fine, but requires more testing).

3. BTRFS
    - I've read that BTRFS can perform data scrubbing and repair 
damaged files from redundant copies.

4. CEPH
    - I've read that CEPH can perform a deep scrub to find damaged 
copies.  I assume by the distributed nature of CEPH, it can repair the 
damaged copy from the other OSDs.

One thing I am not clear on is when BTRFS / CEPH finds damaged data, 
what do they do to prevent data from being written to the same area?

Also, I'm wondering if any parts to my layered approach are redundant / 
unnecessary...  For instance if BTRFS marks the block bad internally, 
then perhaps the BBR DM Target isn't needed...

In my testing recently, I had the following setup:
  Disk -> DM-Crypt -> DM-BBR -> BTRFS -> OSD

When the OSD hit a bad block, the DM-BBR target successfully remapped 
it to one of its own reserved blocks, BTRFS then reported data 
corruption, and the OSD daemon crashed.

--
Thanks,
Dyweni
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html