Stuck + Incomplete after deleting to allow osd to start

Nico Schottelius <nico-ceph-users@xxxxxxxxxxxxxxx> · Fri, 19 Dec 2014 14:55:24 +0100

Hello,

we have had some trouble of osds running full,
even after rebalancing. So at 100% usage and ceph-osds not starting
anymore, we decided to delete some pg directories, after which
rebalancing finished.

However after this, we have the situation that one pg is not
becoming clean anymore.

We tried to

    a) stop, stop+out osd.7 -> after rebalancing the pg is still stuck
    b) Mark objects lost: root@wein:~# ceph pg 3.14 mark_unfound_lost revert
        pg has no unfound objects
    c) stop osd.7, rsync the directory 3.14_head from osd.2, start osd.7
    d) ceph pg scrub 3.14

So far the status is still that this pg is down.

I have attached some of the lines / logs.

I would be grateful if you can give any hints on how to repair this situation.

Cheers,

Nico

p.s.: Using ceph 0.80.7.

Action causing the problem:

    root@wein:/var/lib/ceph/osd/ceph-7/current# ls
    0.12_head  0.a_head   2.1c_head  3.2a_head  3.4c_head  3.6b_TEMP  3.8b_head  3.97_TEMP  3.c7_TEMP
    0.14_head  1.10_head  2.26_head  3.32_head  3.4c_TEMP  3.6c_head  3.8d_head  3.9b_head  3.c_head
    0.21_head  1.1a_head  2.2a_head  3.32_TEMP  3.56_head  3.6c_TEMP  3.8d_TEMP  3.9b_TEMP  3.d_head
    0.23_head  1.21_head  2.2e_head  3.37_head  3.56_TEMP  3.6_head   3.8e_head  3.a9_head  3.d_TEMP
    0.2b_head  1.2b_head  2.2f_head  3.37_TEMP  3.5b_head  3.7b_head  3.8_head   3.a9_TEMP  3.f_head
    0.2d_head  1.2c_head  2.33_head  3.47_head  3.5b_TEMP  3.7b_TEMP  3.91_head  3.ab_TEMP  3.f_TEMP
    0.2e_head  1.32_head  2.3f_head  3.47_TEMP  3.60_head  3.80_head  3.91_TEMP  3.b2_TEMP  commit_op_seq
    0.2_head   1.37_head  2.b_head   3.49_head  3.61_head  3.81_head  3.93_head  3.b7_TEMP  meta
    0.38_head  1.3c_head  3.0_head   3.49_TEMP  3.61_TEMP  3.82_head  3.93_TEMP  3.bf_head  nosnap
    0.3b_head  1.e_head   3.12_head  3.4a_head  3.67_head  3.82_TEMP  3.94_head  3.bf_TEMP  omap
    0.3e_head  2.10_head  3.14_head  3.4a_TEMP  3.67_TEMP  3.89_head  3.94_TEMP  3.b_head
    0.7_head   2.15_head  3.14_TEMP  3.4b_head  3.6b_head  3.89_TEMP  3.97_head  3.b_TEMP
    root@wein:/var/lib/ceph/osd/ceph-7/current# du -sh 3.14_*
    3.9G    3.14_head
    4.0K    3.14_TEMP

The current status:

    root@kaffee:~# ceph -s
        cluster e0611730-09ff-4f3c-bfdb-2dd415274a36
         health HEALTH_WARN 1 pgs down; 1 pgs peering; 1 pgs stuck inactive; 1 pgs stuck unclean; 5 requests are blocked > 32 sec
         monmap e3: 3 mons at {kaffee=192.168.40.1:6789/0,tee=192.168.40.2:6789/0,wein=192.168.40.3:6789/0}, election epoch 3652, quorum 0,1,2 kaffee,tee,wein
         osdmap e1129: 8 osds: 7 up, 7 in
          pgmap v435448: 448 pgs, 4 pools, 976 GB data, 248 kobjects
                1938 GB used, 9913 GB / 11852 GB avail
                     447 active+clean
                       1 down+peering

    root@wein:/var/lib/ceph/osd/ceph-7/current# ceph health detail
    HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 5 requests are blocked > 32 sec; 1 osds have slow requests
    pg 3.14 is stuck inactive for 135697.438689, current state incomplete, last acting [2,7]
    pg 3.14 is stuck unclean for 135697.438702, current state incomplete, last acting [2,7]
    pg 3.14 is incomplete, acting [2,7]
    5 ops are blocked > 8388.61 sec
    5 ops are blocked > 8388.61 sec on osd.2
    1 osds have slow requests

    root@wein:~# ceph pg dump_stuck stale
    ok
    root@wein:~# ceph pg dump_stuck unclean
    ok
    pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reportedup       up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
    3.14    1006    0       0       0       4135415824      3001    3001    incomplete      2014-12-19 14:40:00.272775       589'27399       1150:66317      [2,7]   2       [2,7]   2       503'24268       2014-12-13 19:17:39.272720       503'24268       2014-12-13 19:17:38.672258
    root@wein:~# ceph pg dump_stuck inactive
    ok
    pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reportedup       up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
    3.14    1006    0       0       0       4135415824      3001    3001    incomplete      2014-12-19 14:40:00.272775       589'27399       1150:66317      [2,7]   2       [2,7]   2       503'24268       2014-12-13 19:17:39.272720       503'24268       2014-12-13 19:17:38.672258
    root@wein:~# 

    root@wein:~# ceph osd tree
    # id    weight  type name       up/down reweight
    -1      2.3     root default
    -2      0.2999          host wein
    0       0.04999                 osd.0   up      1       
    3       0.04999                 osd.3   up      1       
    4       0.04999                 osd.4   up      1       
    5       0.04999                 osd.5   up      1       
    6       0.04999                 osd.6   up      1       
    7       0.04999                 osd.7   up      1       
    -3      1               host tee
    1       5.5                     osd.1   up      1       
    -4      1               host kaffee
    2       5.5                     osd.2   up      1       
    root@wein:~# 

Fixes we tried:

root@wein:~# ceph pg 3.14 mark_unfound_lost revert
pg has no unfound objects

root@kaffee:~# rsync -av /var/lib/ceph/osd/ceph-2/current/3.14_head/ root@xxxxxxxxxxxxxxxx:/var/lib/ceph/osd/ceph-7/current/3.14_head/
+ stop & restart osd.7 around it

root@wein:~# ceph pg deep-scrub  3.14
instructing pg 3.14 on osd.2 to deep-scrub

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com