Re: Accidentally Remove OSDs

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 23 Apr 2015 20:08:24 -0600

What hosts were those OSDS on? I'm concerned that two OSDS for some of the PGS were adjacent and if that placed them on the same host, it would be contrary to your rules and something deeper is wrong. 
Did you format the disks that were taken out of the cluster? Can you mount the partitions and see the files and directories? If so, you can probably recover the data using the tools from the recovery/dev tools. 
You may be able to force create the missing PGS using ceph force-create <pg.id>. This may or may not work, I don't remember. 
If you just don't care about losing data, you can delete the pool and create a new one. This should work for sure, but losses any data that you might have still had. If this pool was full of RBD, then there is a high possibility that all of your RBD images had chunks in the missing PGs. If you choose not to try to restore the PGS using the tools,  I'd be inclined to delete the pool and restore from back up as to not be surprised by data corruption in the images. Neither option is ideal or quick. 
Robert LeBlanc
Sent from a mobile device please excuse any typos.
On Apr 23, 2015 6:42 PM, "FaHui Lin" <fahui.lin@xxxxxxxxxx> wrote:

    Hi, thank you for your response.

    Well, I've not only taken out but also totally removed the both OSDs
    (by "ceph osd rm" and delete everything in
    /var/lib/ceph/osd/<related OSDs>) of that pg (and similar to
    all other stale pgs.)

    The main problem I have is those stale pgs (miss all OSDs I've
    removed) not merely make ceph health warning, but other machine
    cannot mount the ceph rbd as well.

    Here's the full crush map.  The OSDs I removed were osd.5~19.

    #
          begin crush map

          tunable choose_local_tries 0

          tunable choose_local_fallback_tries 0

          tunable choose_total_tries 500

          # devices

          device 0 osd.0

          device 1 device1

          device 2 osd.2

          device 3 osd.3

          device 4 osd.4

          device 5 device5

            device 6 device6

            device 7 device7

            device 8 device8

            device 9 device9

            device 10 device10

            device 11 device11

            device 12 device12

            device 13 device13

            device 14 device14

            device 15 device15

            device 16 device16

            device 17 device17

            device 18 device18

            device 19 device19

          device 20 osd.20

          device 21 osd.21

          device 22 osd.22

          device 23 osd.23

          device 24 osd.24

          device 25 osd.25

          device 26 osd.26

          device 27 osd.27

          # types

          type 0 osd

          type 1 host

          type 2 rack

          type 3 row

          type 4 room

          type 5 datacenter

          type 6 root

          # buckets

          host XX-ceph01 {

                  id -2           # do not change unnecessarily

                  # weight 160.040

                  alg straw

                  hash 0  # rjenkins1

                  item osd.0 weight 40.010

                  item osd.2 weight 40.010

                  item osd.3 weight 40.010

                  item osd.4 weight 40.010

          }

          host XX-ceph02 {

                  id -3           # do not change unnecessarily

                  # weight 320.160

                  alg straw

                  hash 0  # rjenkins1

                  item osd.20 weight 40.020

                  item osd.21 weight 40.020

                  item osd.22 weight 40.020

                  item osd.23 weight 40.020

                  item osd.24 weight 40.020

                  item osd.25 weight 40.020

                  item osd.26 weight 40.020

                  item osd.27 weight 40.020

          }

          root default {

                  id -1           # do not change unnecessarily

                  # weight 480.200

                  alg straw

                  hash 0  # rjenkins1

                  item XX-ceph01 weight 160.040

                  item XX-ceph02 weight 320.160

          }

          # rules

          rule data {

                  ruleset 0

                  type replicated

                  min_size 1

                  max_size 10

                  step take default

                  step chooseleaf firstn 0 type host

                  step emit

          }

          rule metadata {

                  ruleset 1

                  type replicated

                  min_size 1

                  max_size 10

                  step take default

                  step chooseleaf firstn 0 type host

                  step emit

          }

          rule rbd {

                  ruleset 2

                  type replicated

                  min_size 1

                  max_size 10

                  step take default

                  step chooseleaf firstn 0 type host

                  step emit

          }

          # end crush map

    List of some stale pgs:

    pg_stat
          objects mip     degr    misp    unf     bytes   log    
          disklog state   state_stamp     v       reported       
          up      up_primary      acting  acting_primary 
          last_scrub      scrub_stamp     last_deep_scrub
          deep_scrub_stamp

          17.c6   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:09.358613      0'0     2706:216        [19,13] 19     
          [19,13] 19      0'0     2015-04-16 02:29:34.882038

                0'0     2015-04-16 02:29:34.882038

          17.c7   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:28.304621      0'0     2718:262        [15,18] 15     
          [15,18] 15      0'0     2015-04-20 09:15:39.363310

                0'0     2015-04-20 09:15:39.363310

          17.c1   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:01.073681      0'0     2706:199        [19,16] 19     
          [19,16] 19      0'0     2015-04-15 12:37:11.741251

                0'0     2015-04-15 12:37:11.741251

          17.de   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:29.436796      0'0     2718:267       
          [15]    15      [15]    15      0'0     2015-04-13
          07:56:01.760824      0'0     2015-04-13 07:56:01.760824

          17.da   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:50.001087      0'0     2718:232       
          [14]    14      [14]    14      0'0     2015-04-19
          15:45:53.304596      0'0     2015-04-19 15:45:53.304596

          17.d9   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:29.472983      0'0     2718:270       
          [14]    14      [14]    14      0'0     2015-04-16
          01:55:44.183550      0'0     2015-04-16 01:55:44.183550

          17.d7   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:53.839134      0'0     2718:68 [17]   
          17      [17]    17      0'0     2015-04-16
          00:06:27.998210      0'0     2015-04-16 00:06:27.998210

          17.d5   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:28.311352      0'0     2718:226        [18,17] 18     
          [18,17] 18      0'0     2015-04-15 20:52:33.372369

                0'0     2015-04-15 20:52:33.372369

          17.d0   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:24.850188      0'0     2718:213        [15,12] 15     
          [15,12] 15      0'0     2015-04-19 15:40:32.215234

                0'0     2015-04-19 15:40:32.215234

          17.d1   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:24.849996      0'0     2718:227        [15,12] 15     
          [15,12] 15      0'0     2015-04-15 19:03:38.137147

                0'0     2015-04-15 19:03:38.137147

          17.ae   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:28.310506      0'0     2718:231        [18,12] 18     
          [18,12] 18      0'0     2015-04-16 02:23:35.031329

                0'0     2015-04-16 02:23:35.031329

          17.ac   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:50.002406      0'0     2718:66 [12]   
          12      [12]    12      0'0     2015-04-16
          02:23:33.023476      0'0     2015-04-16 02:23:33.023476

          17.aa   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:25.983034      0'0     2718:213        [15,14] 15     
          [15,14] 15      0'0     2015-04-19 15:32:38.896039

                0'0     2015-04-19 15:32:38.896039

          17.ab   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:24.836133      0'0     2718:260        [12,17] 12     
          [12,17] 12      0'0     2015-04-19 15:32:44.905707

                0'0     2015-04-19 15:32:44.905707

          17.a8   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:09.361319      0'0     2706:212        [19,13] 19     
          [19,13] 19      0'0     2015-04-16 02:23:32.026015

                0'0     2015-04-16 02:23:32.026015

          17.a6   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:50.002804      0'0     2718:96 [18]   
          18      [18]    18      0'0     2015-04-20
          14:02:29.334181      0'0     2015-04-20 14:02:29.334181

          17.a4   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:28.310707      0'0     2718:232        [18,17] 18     
          [18,17] 18      0'0     2015-04-16 02:22:12.018136

                0'0     2015-04-16 02:22:12.018136

          17.a2   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:11.624952      0'0     2718:200        [15,17] 15     
          [15,17] 15      0'0     2015-04-15 10:42:37.880699

                0'0     2015-04-15 10:42:37.880699

          17.a0   0       0       0       0       0       0      
          0       0       stale+active+undersized+degraded       
          2015-04-20 23:41:29.469600      0'0     2718:66 [18]   
          18      [18]    18      0'0     2015-04-16
          02:22:08.992748      0'0     2015-04-16 02:22:08.992748

    OSDs of those pgs (either primary or secondary) are totally gone,
    and I cannot find a way to repair them.

    I've had another machince of new drive partitions, and I tried to
    re-create OSDs I had removed on it, but that would be osd.28, 29,
    etc. That's why I wondered how to change ID number of an OSD.

    Regardless of the data loss (which I think it's already happened),
    I'd like to make the ceph service normal asap.

    Is there anyway to deal with those stale pgs? (such as to recreate
    the OSDs they need, or to inject exsisting OSDs to those pgs, or
    even to kill those pgs?)

    And since I'm not experienced, I may need more concrete comments
    (i.e. approach with ceph commands). Many thaks for your help.

    Best Regards,

    FaHui

    Robert LeBlanc 於 2015/4/23 下午 10:53 寫道:

      A full CRUSH dump would be helpful, as well as
        knowing which OSDs you took out. If you didn't take 17 out as
        well as 15, then you might be OK. If the OSDs still show up in
        your CRUSH, then try and remove them from the CRSH map with
        'ceph osd crush rm osd.15'.

        If you took out both OSDs, you will need to use some of the
          recovery tools. I believe the procedure is roughly, mount the
          drive in another box, extract the PGs needed, then shut down
          the primary OSD for that PG, inject the PG into the OSD, then
          start it up and it should replicate. I haven't done it myself
          (probably something I should do in case I ever run into the
          problem).

        On Thu, Apr 23, 2015 at 2:00 AM, FaHui
          Lin <fahui.lin@xxxxxxxxxx>
          wrote:

             Dear Ceph experts,

              I'm a very new Ceph user. I made a blunder that I removed
              some OSDs (and all files in the related directories)
              before Ceph finished rebalancing datas and migrating pgs.

              Not to mention the data loss, I meet the problem that:

              1) There are always stale pgs showing in ceph status (with
              heath warning). Say one of the stale pg 17.a2:

              # ceph -v

                    ceph version 0.87.1
                    (283c2e7cfa2457799f534744d7d549f83ea1335e)

                    # ceph -s

                        cluster 3f81b47e-fb15-4fbb-9fee-0b1986dfd7ea

                         health HEALTH_WARN 203 pgs degraded; 366 pgs
                    stale; 203 pgs stuck degraded; 366 pgs stuck
                      stale; 203 pgs stuck unclean; 203 pgs stuck
                    undersized; 203 pgs undersized; 154 requests are
                    blocked > 32 sec; recovery 153738/18991802
                    objects degraded (0.809%)

                         monmap e1: 1 mons at {...=...:6789/0}, election
                    epoch 1, quorum 0 tw-ceph01

                         osdmap e3697: 12 osds: 12 up, 12 in

                          pgmap v21296531: 1156 pgs, 18 pools, 36929 GB
                    data, 9273 kobjects

                                72068 GB used, 409 TB / 480 TB avail

                                153738/18991802 objects degraded
                    (0.809%)

                                     163 stale+active+clean

                                     786 active+clean

                                     203
                    stale+active+undersized+degraded

                                       4 active+clean+scrubbing+deep

                    # ceph pg dump_stuck stale | grep 17.a2

                    17.a2   0       0       0       0       0      
                    0       0       0       stale+active+clean     
                    2015-04-20 09:16:11.624952     0'0    
                    2718:200        [15,17] 15      [15,17] 15     
                    0'0     2015-04-15 10:42:37.880699    0'0     
                    2015-04-15 10:42:37.880699

                    # ceph pg repair 17.a2

                    Error EAGAIN: pg 17.a2 primary osd.15 not up

                    # ceph pg scrub 17.a2

                    Error EAGAIN: pg 17.a2 primary osd.15 not up

                    # ceph pg map 17.a2

                    osdmap e3695 pg 17.a2 (17.a2) -> up [27,3] acting
                    [27,3]

              where osd.15 had already been removed. It seems to map to
              the existing OSDs ([27, 3]).

              Can this pg finally get recovered by changing to the
              existing OSDs? If not, how can I do about this kind of
              stale pg?

              2) I tried to solve the problem above by creating OSDs
              back but failed. The reason was I cannot create an OSD
              with the same ID to that I removed, say osd.15 (or change
              the id of an OSD).

              Is there any way to change the id of an OSD? (By the way,
              I'm suprised that this issue can hardly be found on the
              internet.)

              3) I tried another thing: to dump the crushmap and remove
              everything (including devices and buckets sections)
              related to the OSDs I removed. However, after I set the
              crushmap and dumped it out again, I found the OSDs's line
              still appear in the devices section (not in the buckets
              section though), such as:

              # devices

                    device 0 osd.0

                    device 2 osd.2

                    device 3 osd.3

                    device 4 osd.4

                    device 5 device5

                      ...

                      device 14 device14

                      device 15 device15

              Is there anyway to remove them? Does it matters when I
              want to add new OSDs?

              Please inform me if you have any comments. Thank you.

              Best Regards,

              FaHui

            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com