Accidentally Remove OSDs

FaHui Lin <fahui.lin@xxxxxxxxxx> · Thu, 23 Apr 2015 16:00:45 +0800



    Dear Ceph experts,

    
    I'm a very new Ceph user. I made a blunder that I removed some OSDs
    (and all files in the related directories) before Ceph finished
    rebalancing datas and migrating pgs.

    
    Not to mention the data loss, I meet the problem that:

    
    1) There are always stale pgs showing in ceph status (with heath
    warning). Say one of the stale pg 17.a2:

    #
          ceph -v

          ceph version 0.87.1
          (283c2e7cfa2457799f534744d7d549f83ea1335e)

          
          # ceph -s

              cluster 3f81b47e-fb15-4fbb-9fee-0b1986dfd7ea

               health HEALTH_WARN 203 pgs degraded; 366 pgs stale; 203
          pgs stuck degraded; 366 pgs stuck stale; 203 pgs stuck
          unclean; 203 pgs stuck undersized; 203 pgs undersized; 154
          requests are blocked > 32 sec; recovery 153738/18991802
          objects degraded (0.809%)

               monmap e1: 1 mons at {...=...:6789/0}, election epoch 1,
          quorum 0 tw-ceph01

               osdmap e3697: 12 osds: 12 up, 12 in

                pgmap v21296531: 1156 pgs, 18 pools, 36929 GB data, 9273
          kobjects

                      72068 GB used, 409 TB / 480 TB avail

                      153738/18991802 objects degraded (0.809%)

                           163 stale+active+clean

                           786 active+clean

                           203 stale+active+undersized+degraded

                             4 active+clean+scrubbing+deep

          
          # ceph pg dump_stuck stale | grep 17.a2

          17.a2   0       0       0       0       0       0      
          0       0       stale+active+clean      2015-04-20
          09:16:11.624952     0'0     2718:200        [15,17] 15     
          [15,17] 15      0'0     2015-04-15 10:42:37.880699    0'0     
          2015-04-15 10:42:37.880699

          
          # ceph pg repair 17.a2

          Error EAGAIN: pg 17.a2 primary osd.15 not up

          
          # ceph pg scrub 17.a2

          Error EAGAIN: pg 17.a2 primary osd.15 not up

          
          # ceph pg map 17.a2

          osdmap e3695 pg 17.a2 (17.a2) -> up [27,3] acting [27,3]

        
    where osd.15 had already been removed. It seems to map to the
    existing OSDs ([27, 3]).

    Can this pg finally get recovered by changing to the existing OSDs?
    If not, how can I do about this kind of stale pg?

    
    2) I tried to solve the problem above by creating OSDs back but
    failed. The reason was I cannot create an OSD with the same ID to
    that I removed, say osd.15 (or change the id of an OSD).

    Is there any way to change the id of an OSD? (By the way, I'm
    suprised that this issue can hardly be found on the internet.)

    
    3) I tried another thing: to dump the crushmap and remove everything
    (including devices and buckets sections) related to the OSDs I
    removed. However, after I set the crushmap and dumped it out again,
    I found the OSDs's line still appear in the devices section (not in
    the buckets section though), such as:

    #
          devices

          device 0 osd.0

          device 2 osd.2

          device 3 osd.3

          device 4 osd.4

          device 5 device5

            ...

            device 14 device14

            device 15 device15

        
    Is there anyway to remove them? Does it matters when I want to add
    new OSDs?

    
    Please inform me if you have any comments. Thank you.

    
    Best Regards,

    FaHui

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com