Re: CEPH over SW-RAID

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Mon, 23 Nov 2015 20:26:24 +0100



    Le 23/11/2015 19:58, Jose Tavares a
      écrit :

    
          On Mon, Nov 23, 2015 at 4:15 PM,
            Lionel Bouton <lionel-subscription@xxxxxxxxxxx>
            wrote:

            
               Hi,

                  
                  Le 23/11/2015 18:37, Jose Tavares a écrit :

                  > Yes, but with SW-RAID, when we have a block that was read and does not match its checksum, the device falls out of the array

                  
                 I don't think so. Under normal circumstances a
                device only falls out of a md array if it doesn't answer
                IO queries after a timeout (md arrays only read from the
                smallest subset of devices needed to get the data, they
                don't verify redundancy on the fly for performance
                reasons). This may not be the case when you explicitly
                ask an array to perform a check though (I don't have any
                first-hand check failure coming to mind).

                
                >, and the data is read again from the other devices in the array. The problem is that in SW-RAID1 we don't have the badblocks isolated. The disks can be sincronized again as the write operation is not tested. The problem (device falling out of the array) will happen again if we try to read any other data written over the bad block.

With consumer-level SATA drives badblocks are handled internally
                nowadays : the drives remap bad sectors to a reserve by
                trying to copy their content there (this might fail and
                md might not have the opportunity to correct the error:
                it doesn't use checksums so it can't tell which drive
                has unaltered data, only which one doesn't answer IO
                queries).
            
            
            hmm, suppose the drive is unable to remap bad blocks
              internally, when you write data to the drive, it will also
              write in hardware the data checksum.
          
        
    One weak data checksum which is not available to the kernel, yes.
    Filesystems and applications on top of them may use stronger
    checksums and handle read problems that the drives can't detect
    themselves.

    
             When you read the data, it will compare to this
              checksum that was written previously. If it fails, the
              drive will reset and the SW-RAID will drop the drive. This
              is how sata drives work..
          
        
    If it fails AFAIK from past experience it doesn't reset by itself,
    the kernel driver in charge of the device will receive an IO error
    and will retry the IO several times. One of those latter attempts
    might succeed (errors aren't always repeatable) and eventually after
    a timeout it will try to reset the interface with the drive and the
    drive itself (the kernel doesn't know where the problem is only that
    it didn't get the result it was expecting).

    While this happen I believe the filesystem/md/lvm/... stack can
    receive an IO error (the timeout at their level might not be the
    same as the timeout at the device level). So some errors can be
    masked to md and some can percolate through. In the later case, yes
    the md array will drop the device.

    
                  > My new question regarding Ceph is if it isolates this bad sectors where it found bad data when scrubbing? or there will be always a replica of something over a known bad block..?


                 Ceph OSDs don't know about bad sectors, they
                delegate IO to the filesystems beneath them. Some
                filesystems can recover from corrupted data from one
                drive (ZFS or BTRFS when using redundancy at their
                level) and the same filesystems will refuse to give Ceph
                OSD data when they detect corruption on non redundant
                filesystems, Ceph detects this (usually during scrubs)
                and then manual Ceph repair will rewrite data over the
                corrupted data (at this time if the underlying drive
                detected a bad sector it will not reuse it).
            
            
            Just forget about the hardware bad block remapped list.
              It got filled as soon as we start to use the drive .. :)
          
        
    Then you can move this drive to the trash pile/ask for a
    replacement. It is basically unusable.

    
                  >
> I also saw that Ceph use same metrics when capturing data from disks. When the disk is resetting or have problems, its metrics are going to be bad and the cluster will rank bad this osd. But I didn't saw any way of sending alerts or anything like that. SW-RAID has its mdadm monitor that alerts when things go bad. Should I have to be looking for ceph logs all the time to see when things go bad?


                 I'm not aware of any osd "ranking".

                    
                    Lionel

                  
            Does "weight" means the same?
          
        
    There are 2 weights I'm aware of, the crush weight for an OSD and
    the temporary OSD weight. The first is the basic weight used by
    crush to choose how to split your data (an OSD with a weight of 2 is
    expected to get roughly twice the amount of data of an OSD with a
    weight of 1 on a normal Ceph cluster), the second is used for
    temporary adjustments when a OSD gets temporarily overused (during
    cluster wide rebalancing typically) and is reset when the OSD
    rejoins the cluster (marked in).

    
    Neither of these weights has anything to do with the OSD underlying
    device health ("being bad").

    
    Lionel

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com