Re: CEPH over SW-RAID

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Mon, 23 Nov 2015 21:40:38 +0100



    Le 23/11/2015 21:01, Jose Tavares a
      écrit :

    
                > My new question regarding Ceph is if it isolates this bad sectors where it found bad data when scrubbing? or there will be always a replica of something over a known bad block..?


               Ceph OSDs don't know about bad sectors, they
              delegate IO to the filesystems beneath them. Some
              filesystems can recover from corrupted data from one drive
              (ZFS or BTRFS when using redundancy at their level) and
              the same filesystems will refuse to give Ceph OSD data
              when they detect corruption on non redundant filesystems,
              Ceph detects this (usually during scrubs) and then manual
              Ceph repair will rewrite data over the corrupted data (at
              this time if the underlying drive detected a bad sector it
              will not reuse it).
          
          
          Just forget about the hardware bad block remapped list.
            It got filled as soon as we start to use the drive .. :)
          
            
                 Then you can move this drive to the trash
                pile/ask for a replacement. It is basically unusable.
            
            
            Why?
            1 (or more) out of 8 drives I see have the remap list
              full ...
            If you isolate the rest using software you can continue
              to use the drive .. There are no performance issues, etc
              ..
            

    Ceph currently uses filesystems to store its data. As there are no
    supported filesystems/software layer handling badblocks dynamically,
    you *will* have some OSD filesystems being remounted read-only and
    OSD failures as soon as you hit one sector misbehaving (if they
    already emptied the reserve you are almost guaranteed to get new
    defective sectors later, see below). If your bad drives are
    distributed over your whole cluster, you will have far more chances
    of simultaneous failures and degraded or inactive pgs (which will
    freeze any IO to them). You will then have to manually put these
    OSDs back online to recover (unfreeze IO). If you don't succeed
    because the drives failed to the point that you can't recover the
    OSD content, you will simply lose data.

    
    From what I can read here, the main filesystems for Ceph are XFS,
    Btrfs and Ext4 with some people using ZFS. Of those 4, only ext4 has
    support for manually setting badblocks on an umounted filesystem. If
    you don't have the precise offset for each of them, you'll have to
    scan the whole device (e2fsck -c) before e2fsck can *try* to put
    your filesystem in shape after any bad block is detected. You will
    have to be very careful to remove any file using a bad block to
    avoid corrupting data before restarting the OSD (hopefully e2fsck
    should move them for you to lost+found). You might not be able to
    restart the OSD depending on the actual files missing.

    
    Finally at least XFS and Btrfs don't have any support for bad blocks
    AFAIK. So you simply can't use your drives with these 2 filesystems
    without the filesystems failing and fsck not working. MD raid won't
    help you either as it has zero support for badblocks.

    
    The fact that badblocks support is almost non-existent is simple to
    understand from past history. Only old filesystems that were used
    when drives didn't have internal reserves to handle badblocks
    transparently and bad blocks were a normal occurence still have
    support for keeping tabs on bad sectors (ext4 got it from ext2,
    vfat/fat32 has it too, ...). Today a disk drive which starts to
    report bad sectors on reads has emptied its reserve so it has a
    large history of bad sectors already. It isn't failing one sector,
    it's in the process of failing thousands of them, so there's no
    reason to expect it to behave correctly anymore : all the
    application layers above (md, lvm, filesystems, ...) just don't try
    to fight a battle that can't be won and would add complexity and
    diminish performance in the case of a normal working drive.

    
    Lionel

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com