Re: Simulating Disk Failure

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Mon, 17 Jun 2013 14:40:09 -0700



    Thanks.  I'll have to get more
      creative.  :-)

      
      On 6/14/13 18:19 , Gregory Farnum wrote:

    
    Yeah. You've picked up on some warty bits of Ceph's
      error handling here for sure, but it's exacerbated by the fact
      that you're not simulating what you think. In a real disk error
      situation the filesystem would be returning EIO or something, but
      here it's returning ENOENT. Since the OSD is authoritative for
      that key space and the filesystem says there is no such object,
      presto! It doesn't exist.
      
        If you restart the OSD it does a scan of the PGs on-disk as well
        as what it should have, and can pick up on the data not being
        there and recover. But "correctly" handling data that has been
        (from the local FS' perspective) properly deleted under a
        running process would require huge and expensive contortions on
        the part of the daemon (in any distributed system that I can
        think of).
      -Greg

        
        On Friday, June 14, 2013, Craig Lewis wrote:

        
           So I'm trying to break
            my test cluster, and figure out how to put it back together
            again.  I'm able to fix this, but the behavior seems strange
            to me, so I wanted to run it past more experienced people.

            
            I'm doing these tests using RadosGW.  I currently have 2
            nodes, with replication=2.  (I haven't gotten to the cluster
            expansion testing yet).

            
            I'm going to upload a file, then simulate a disk failure by
            deleting some PGs on one of the OSDs.  I have seen this
            mentioned as the way to fix OSDs that filled up during
            recovery/backfill.  I expected the cluster to detect the
            error, change the cluster health to warn, then return the
            data from another copy.  Instead, I got a 404 error.

            
            me@client ~ $ s3cmd ls

            2013-06-12 00:02  s3://bucket1

            
            me@client ~ $ s3cmd ls s3://bucket1

            2013-06-12 00:02        13  
              8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

            
            me@client ~ $ s3cmd put Object1 s3://bucket1

            Object1 -> s3://bucket1/Object1  [1 of 1]

             400000000 of 400000000   100% in   62s     6.13
              MB/s  done

            
             me@client ~ $ s3cmd ls s3://bucket1

             2013-06-13 01:10       381M 
              15bdad3e014ca5f5c9e5c706e17d65f3  s3://bucket1/Object1

             2013-06-12 00:02        13  
              8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

            
            So at this point, the cluster is healthy, and we can
            download objects from RGW.

              
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

                health HEALTH_OK

                monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

                osdmap e44: 2 osds: 2 up, 2 in

                 pgmap v4055: 248 pgs: 248
              active+clean; 2852 MB data, 7941 MB used, 94406 MB /
              102347 MB avail; 17B/s rd, 0op/s

                mdsmap e1: 0/0/1 up

             
             me@client ~ $ s3cmd get
              s3://bucket1/Object1 ./Object.Download1

             s3://bucket1/Object1 ->
              ./Object.Download1  [1 of 1]

              400000000 of 400000000   100% in  
              13s    27.63 MB/s  done

              
            Time to simulate a failure.  Let's delete all the PGs
            used by .rgw.buckets on OSD.0.

             
            me@dev-ceph0:~$ ceph osd tree

            
            # id    weight    type name    up/down    reweight

            -1    0.09998    root default

            -2    0.04999        host dev-ceph0

            0    0.04999            osd.0    up    1

            -3    0.04999        host dev-ceph1

            1    0.04999            osd.1    up    1

            
            me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets

             pool 9 '.rgw.buckets' rep size 2
              min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8
              pgp_num 8 last_change 21 owner 18446744073709551615

              
              me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current

            me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du
              -sh 9.*

             321M    9.0_head

             289M    9.1_head

             425M    9.2_head

             357M    9.3_head

             358M    9.4_head

             309M    9.5_head

             401M    9.6_head

             397M    9.7_head

             
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              sudo rm -rf 9.*

              
            The cluster is still healthy

             
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

                health HEALTH_OK

                monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

                osdmap e44: 2 osds: 2 up, 2 in

                 pgmap v4059: 248 pgs: 248
              active+clean; 2852 MB data, 7941 MB used, 94406 MB /
              102347 MB avail; 16071KB/s rd, 3op/s

                mdsmap e1: 0/0/1 up

             
            It probably hasn't noticed the damage yet, there's no
            I/O on this test cluster unless I generate it.  Lets
            retrieve some data, that'll make the cluster notice.

              
             me@client ~ $ s3cmd get
              s3://bucket1/Object1 ./Object.Download2

             s3://bucket1/Object1 ->
              ./Object.Download2  [1 of 1]

             ERROR: S3 error: 404 (Not Found):

             
             me@client ~ $ s3cmd ls s3://bucket1

             ERROR: S3 error: 404 (NoSuchKey):

              
            I wasn't expecting that.  I expected my object to still
            be accessible.  Worst case, it should be accessible 50% of
            the time.  Instead, it's 0% accessible.  And the cluster
            thinks it's still healhty:

            
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

                health HEALTH_OK

                monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

                osdmap e44: 2 osds: 2 up, 2 in

                 pgmap v4059: 248 pgs: 248
              active+clean; 2852 MB data, 7941 MB used, 94406 MB /
              102347 MB avail; 16071KB/s rd, 3op/s

                mdsmap e1: 0/0/1 up

            
            Scrubbing the PGs corrects the cluster's status, but still
            doesn't let me download

            
            me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in
              `seq 0 7`

            >  do

            >   ceph pg scrub 9.$i

            > done

            instructing pg 9.0 on osd.0 to scrub

            instructing pg 9.1 on osd.0 to scrub

            instructing pg 9.2 on osd.1 to scrub

            instructing pg 9.3 on osd.0 to scrub

            instructing pg 9.4 on osd.0 to scrub

            instructing pg 9.5 on osd.1 to scrub

            instructing pg 9.6 on osd.1 to scrub

            instructing pg 9.7 on osd.0 to scrub

            
            me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph
              status

               health HEALTH_ERR 3 pgs inconsistent; 284 scrub
              errors

               monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

               osdmap e44: 2 osds: 2 up, 2 in

                pgmap v4105: 248 pgs: 245 active+clean, 3
              active+clean+inconsistent; 2852 MB data, 5088 MB used,
              97258 MB / 102347 MB avail

               mdsmap e1: 0/0/1 up

            
            And I still can't download my data

              
            me@client ~ $ s3cmd ls s3://bucket1

            ERROR: S3 error: 404 (NoSuchKey):

            
            To fix this, I have to scrub the OSD

            
            me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph osd
              scrub 0

            osd.0 instructed to scrub

            
            This runs for a while, until it reaches the affected PGs. 
            Then the PGs are recovering:

            
            me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph
              status

                health HEALTH_ERR 3 pgs inconsistent;
              2 pgs recovering; 6 pgs recovery_wait; 8 pgs stuck
              unclean; recovery 988/1534 degraded (64.407%);  recovering
              2 o/s, 10647KB/s; 284 scrub errors

               monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

               osdmap e47: 2 osds: 2 up, 2 in

                pgmap v4151: 248 pgs: 240 active+clean, 4
              active+recovery_wait, 1 active+recovering+inconsistent, 2
              active+recovery_wait+inconsistent, 1 active+recovering;
              2852 MB data, 5125 MB used

            7KB/s

               mdsmap e1: 0/0/1 up

            
            As soon as the cluster starts recovering, I can access my
            object again:

            
            me@client ~ $ s3cmd ls s3://bucket1

            2013-06-13 01:10       381M 
              15bdad3e014ca5f5c9e5c706e17d65f3  s3://bucket1/Object1

            2013-06-12 00:02        13  
              8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

            
            me@client ~ $ s3cmd get s3://bucket1/Object1
              ./Object.Download2

            s3://bucket1/Object1 -> ./Object.Download2  [1
              of 1]

             400000000 of 400000000   100% in   92s     4.13
              MB/s  done

            
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

               health HEALTH_ERR 3 pgs inconsistent; 5 pgs
              recovering; 5 pgs stuck unclean; recovery 228/1534
              degraded (14.863%);  recovering 2 o/s, 11025KB/s; 284
              scrub errors

               monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

               osdmap e47: 2 osds: 2 up, 2 in

                pgmap v4259: 248 pgs: 241 active+clean, 1
              active+recovering+inconsistent, 2
              active+clean+inconsistent, 4 active+recovering; 2852 MB
              data, 7428 MB used, 94919 MB / 102347 MB avail; 22

               mdsmap e1: 0/0/1 up

            
            Everything continues to work, but the cluster doesn't
            completely heal:

            
             me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

                 health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors

                 monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

                 osdmap e47: 2 osds: 2 up, 2 in

                  pgmap v4280: 248 pgs: 245 active+clean, 3
              active+clean+inconsistent; 2852 MB data, 7934 MB used,
              94413 MB / 102347 MB avail

                 mdsmap e1: 0/0/1 up

              
            At this point, I have to scrub the inconsistent PGs

              
              me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph pg
              dump | grep inconsistent | cut -f1 | while read pg

              >  do

              >   ceph pg scrub $pg

              > done

              instructing pg 9.5 on osd.1 to scrub

              instructing pg 9.2 on osd.1 to scrub

              instructing pg 9.6 on osd.1 to scrub

            
            Everything continues to work, until cluster has fully
            recovered.

            
              me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$

              ceph status

               health HEALTH_OK

               monmap e2: 2 mons at {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},

              election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

               osdmap e47: 2 osds: 2 up, 2 in

                pgmap v4283: 248 pgs: 248 active+clean; 2852 MB
              data, 7934 MB used, 94413 MB / 102347 MB avail

               mdsmap e1: 0/0/1 up

            
            So I'm a bit confused.  

            Why was the data not accessible between the data loss and
            the manual OSD scrub?  

            What the effective difference between the PG scrub and the
            OSD scrub?

            
            Thanks for the help.

            
            -- 

               
                    Craig Lewis
                    

                     Senior Systems Engineer

                      Office +1.714.602.1309

                      Email clewis@xxxxxxxxxxxxxxxxxx 
                  
                    Central
                        Desktop. Work together in ways you never thought
                        possible.  

                         Connect with us   Website
                           |  Twitter
                           |  Facebook
                           |  LinkedIn
                           |  Blog
                         

      -- 

      Software Engineer #42 @ http://inktank.com
      | http://ceph.com

    
      Craig Lewis 

       Senior Systems Engineer

        Office +1.714.602.1309

        Email clewis@xxxxxxxxxxxxxxxxxx
       
      Central Desktop. Work
          together in ways you never thought possible. 
          

           Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog 
    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com