Re: Simulating Disk Failure

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 14 Jun 2013 18:19:27 -0700

Yeah. You've picked up on some warty bits of Ceph's error handling here for sure, but it's exacerbated by the fact that you're not simulating what you think. In a real disk error situation the filesystem would be returning EIO or something, but here it's returning ENOENT. Since the OSD is authoritative for that key space and the filesystem says there is no such object, presto! It doesn't exist.
If you restart the OSD it does a scan of the PGs on-disk as well as what it should have, and can pick up on the data not being there and recover. But "correctly" handling data that has been (from the local FS' perspective) properly deleted under a running process would require huge and expensive contortions on the part of the daemon (in any distributed system that I can think of).
-Greg

On Friday, June 14, 2013, Craig Lewis  wrote:

    So I'm trying to break my test cluster, and figure out how to put it
    back together again.  I'm able to fix this, but the behavior seems
    strange to me, so I wanted to run it past more experienced people.

    I'm doing these tests using RadosGW.  I currently have 2 nodes, with
    replication=2.  (I haven't gotten to the cluster expansion testing
    yet).

    I'm going to upload a file, then simulate a disk failure by deleting
    some PGs on one of the OSDs.  I have seen this mentioned as the way
    to fix OSDs that filled up during recovery/backfill.  I expected the
    cluster to detect the error, change the cluster health to warn, then
    return the data from another copy.  Instead, I got a 404 error.

    me@client ~ $ s3cmd ls

    2013-06-12 00:02  s3://bucket1

    me@client ~ $ s3cmd ls s3://bucket1

    2013-06-12 00:02        13  
      8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

    me@client ~ $ s3cmd put Object1 s3://bucket1

    Object1 -> s3://bucket1/Object1  [1 of 1]

     400000000 of 400000000   100% in   62s     6.13 MB/s  done

     me@client ~ $ s3cmd ls s3://bucket1

     2013-06-13 01:10       381M 
      15bdad3e014ca5f5c9e5c706e17d65f3  s3://bucket1/Object1

     2013-06-12 00:02        13  
      8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

    So at this point, the cluster is healthy, and we can download
    objects from RGW.

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

        health HEALTH_OK

        monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

        osdmap e44: 2 osds: 2 up, 2 in

         pgmap v4055: 248 pgs: 248 active+clean; 2852
      MB data, 7941 MB used, 94406 MB / 102347 MB avail; 17B/s rd, 0op/s

        mdsmap e1: 0/0/1 up

     me@client ~ $ s3cmd get s3://bucket1/Object1
      ./Object.Download1

     s3://bucket1/Object1 -> ./Object.Download1 
      [1 of 1]

      400000000 of 400000000   100% in   13s    27.63
      MB/s  done

    Time to simulate a failure.  Let's delete all the PGs used by
    .rgw.buckets on OSD.0.

    me@dev-ceph0:~$ ceph osd tree

    # id    weight    type name    up/down    reweight

    -1    0.09998    root default

    -2    0.04999        host dev-ceph0

    0    0.04999            osd.0    up    1

    -3    0.04999        host dev-ceph1

    1    0.04999            osd.1    up    1

    me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets

    pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0
      object_hash rjenkins pg_num 8 pgp_num 8 last_change 21 owner
      18446744073709551615

      me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current

    me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du
      -sh 9.*

     321M    9.0_head

     289M    9.1_head

     425M    9.2_head

     357M    9.3_head

     358M    9.4_head

     309M    9.5_head

     401M    9.6_head

     397M    9.7_head

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      sudo rm -rf 9.*

    The cluster is still healthy

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

        health HEALTH_OK

        monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

        osdmap e44: 2 osds: 2 up, 2 in

         pgmap v4059: 248 pgs: 248 active+clean; 2852
      MB data, 7941 MB used, 94406 MB / 102347 MB avail; 16071KB/s rd,
      3op/s

        mdsmap e1: 0/0/1 up

    It probably hasn't noticed the damage yet, there's no I/O on
    this test cluster unless I generate it.  Lets retrieve some data,
    that'll make the cluster notice.

     me@client ~ $ s3cmd get s3://bucket1/Object1
      ./Object.Download2

     s3://bucket1/Object1 -> ./Object.Download2 
      [1 of 1]

     ERROR: S3 error: 404 (Not Found):

     me@client ~ $ s3cmd ls s3://bucket1

     ERROR: S3 error: 404 (NoSuchKey):

    I wasn't expecting that.  I expected my object to still be
    accessible.  Worst case, it should be accessible 50% of the time. 
    Instead, it's 0% accessible.  And the cluster thinks it's still
    healhty:

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

        health HEALTH_OK

        monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

        osdmap e44: 2 osds: 2 up, 2 in

         pgmap v4059: 248 pgs: 248 active+clean; 2852
      MB data, 7941 MB used, 94406 MB / 102347 MB avail; 16071KB/s rd,
      3op/s

        mdsmap e1: 0/0/1 up

    Scrubbing the PGs corrects the cluster's status, but still doesn't
    let me download

    me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in `seq 0
      7`

    >  do

    >   ceph pg scrub 9.$i

    > done

    instructing pg 9.0 on osd.0 to scrub

    instructing pg 9.1 on osd.0 to scrub

    instructing pg 9.2 on osd.1 to scrub

    instructing pg 9.3 on osd.0 to scrub

    instructing pg 9.4 on osd.0 to scrub

    instructing pg 9.5 on osd.1 to scrub

    instructing pg 9.6 on osd.1 to scrub

    instructing pg 9.7 on osd.0 to scrub

    me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status

       health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors

       monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

       osdmap e44: 2 osds: 2 up, 2 in

        pgmap v4105: 248 pgs: 245 active+clean, 3
      active+clean+inconsistent; 2852 MB data, 5088 MB used, 97258 MB /
      102347 MB avail

       mdsmap e1: 0/0/1 up

    And I still can't download my data

    me@client ~ $ s3cmd ls s3://bucket1

    ERROR: S3 error: 404 (NoSuchKey):

    To fix this, I have to scrub the OSD

    me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph osd scrub 0

    osd.0 instructed to scrub

    This runs for a while, until it reaches the affected PGs.  Then the
    PGs are recovering:

    me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status

        health HEALTH_ERR 3 pgs inconsistent; 2 pgs
      recovering; 6 pgs recovery_wait; 8 pgs stuck unclean; recovery
      988/1534 degraded (64.407%);  recovering 2 o/s, 10647KB/s; 284
      scrub errors

       monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

       osdmap e47: 2 osds: 2 up, 2 in

        pgmap v4151: 248 pgs: 240 active+clean, 4
      active+recovery_wait, 1 active+recovering+inconsistent, 2
      active+recovery_wait+inconsistent, 1 active+recovering; 2852 MB
      data, 5125 MB used

    7KB/s

       mdsmap e1: 0/0/1 up

    As soon as the cluster starts recovering, I can access my object
    again:

    me@client ~ $ s3cmd ls s3://bucket1

    2013-06-13 01:10       381M 
      15bdad3e014ca5f5c9e5c706e17d65f3  s3://bucket1/Object1

    2013-06-12 00:02        13  
      8ddd8be4b179a529afa5f2ffae4b9858  s3://bucket1/hello.txt

    me@client ~ $ s3cmd get s3://bucket1/Object1
      ./Object.Download2

    s3://bucket1/Object1 -> ./Object.Download2  [1 of 1]

     400000000 of 400000000   100% in   92s     4.13 MB/s  done

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

       health HEALTH_ERR 3 pgs inconsistent; 5 pgs recovering;
      5 pgs stuck unclean; recovery 228/1534 degraded (14.863%); 
      recovering 2 o/s, 11025KB/s; 284 scrub errors

       monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

       osdmap e47: 2 osds: 2 up, 2 in

        pgmap v4259: 248 pgs: 241 active+clean, 1
      active+recovering+inconsistent, 2 active+clean+inconsistent, 4
      active+recovering; 2852 MB data, 7428 MB used, 94919 MB / 102347
      MB avail; 22

       mdsmap e1: 0/0/1 up

    Everything continues to work, but the cluster doesn't completely
    heal:

     me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

         health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors

         monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

         osdmap e47: 2 osds: 2 up, 2 in

          pgmap v4280: 248 pgs: 245 active+clean, 3
      active+clean+inconsistent; 2852 MB data, 7934 MB used, 94413 MB /
      102347 MB avail

         mdsmap e1: 0/0/1 up

    At this point, I have to scrub the inconsistent PGs

      me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph pg dump | grep
      inconsistent | cut -f1 | while read pg

      >  do

      >   ceph pg scrub $pg

      > done

      instructing pg 9.5 on osd.1 to scrub

      instructing pg 9.2 on osd.1 to scrub

      instructing pg 9.6 on osd.1 to scrub

    Everything continues to work, until cluster has fully recovered.

      me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$
      ceph status

       health HEALTH_OK

       monmap e2: 2 mons at
      {dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0},
      election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

       osdmap e47: 2 osds: 2 up, 2 in

        pgmap v4283: 248 pgs: 248 active+clean; 2852 MB data,
      7934 MB used, 94413 MB / 102347 MB avail

       mdsmap e1: 0/0/1 up

    So I'm a bit confused.  

    Why was the data not accessible between the data loss and the manual
    OSD scrub?  

    What the effective difference between the PG scrub and the OSD
    scrub?

    Thanks for the help.

    -- 

            Craig Lewis

             Senior Systems Engineer

              Office +1.714.602.1309

              Email clewis@xxxxxxxxxxxxxxxxxx

            Central Desktop.
                Work together in ways you never thought possible.

                 Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com