[ceph 0.72.2] PGs are incomplete status after some OSDs are out of cluster

"Meng, Chen" <chen.meng@xxxxxxxxx> · Mon, 27 Oct 2014 01:41:42 +0000

Hi, Cepher

I meet the issue with ceph 0.72.2 that some OSDs are down & out and cannot rejoin the cluster via restart osd daemon.
Here below is I captured from the osd log which are down & out. All of them have the same error info.

2014-10-22 16:49:18.834850 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28)  error (39) Directory not empty not handled on operation
 21 (72744853.0.2, or op 2, counting from 0)
2014-10-22 16:49:18.834897 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28) ENOTEMPTY suggests garbage data in osd data dir
2014-10-22 16:49:18.834899 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28)  transaction dump:
{ "ops": [
        { "op_num": 0,
          "op_name": "remove",
          "collection": "meta",
          "oid": "bfa3b7d0\/pglog_3.194a\/0\/\/-1"},
        { "op_num": 1,
          "op_name": "omap_rmkeys",
          "collection": "meta",
          "oid": "16ef7597\/infos\/head\/\/-1"},
        { "op_num": 2,
          "op_name": "rmcoll",
          "collection": "3.194a_head"}]}
2014-10-22 16:49:18.951947 7fc2d3fd9700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
 uint64_t, int, ThreadPool::TPHandle*)' thread 7fc2d3fd9700 time 2014-10-22 16:49:18.859039
os/FileStore.cc: 2448: FAILED assert(0 == "unexpected error")

Before that 2 days, I take add some new OSDs into cluster and increase the PG #.
I manually add OSDs which is lost one by one follow the process of add new OSD, it takes a long time to recovey.
And now, ceph health detail shows as following, all OSDs are up and in.

HEALTH_WARN 14 pgs backfilling; 4 pgs degraded; 2 pgs incomplete; 2 pgs stuck inactive; 16 pgs stuck unclean; 200 requests are blocked > 32 sec; 2 osds have slow requests; recovery 13269/10624283 objects degraded (0.125%)
pg 3.1019 is stuck inactive since forever, current state incomplete, last acting [44,39,35]
pg 3.13c4 is stuck inactive since forever, current state incomplete, last acting [49,66,16]
pg 4.df8 is stuck unclean for 719511.538414, current state active+degraded+remapped+backfilling, last acting [62,37]
pg 3.1f3e is stuck unclean for 722343.421598, current state active+remapped+backfilling, last acting [62,33,35,27]
pg 3.1019 is stuck unclean since forever, current state incomplete, last acting [44,39,35]
pg 3.1852 is stuck unclean for 725266.838670, current state active+remapped+backfilling, last acting [5,59,37,35]
pg 3.13c4 is stuck unclean since forever, current state incomplete, last acting [49,66,16]
pg 4.1848 is stuck unclean for 828659.231379, current state active+degraded+remapped+backfilling, last acting [12,37,62]
pg 3.160b is stuck unclean for 726568.972467, current state active+remapped+backfilling, last acting [58,33,5,57]
pg 4.252 is stuck unclean for 757159.325059, current state active+degraded+remapped+backfilling, last acting [30,17,60]
pg 3.9e3 is stuck unclean for 722359.550585, current state active+degraded+remapped+backfilling, last acting [0,17,30]
pg 4.175d is stuck unclean for 784690.950635, current state active+remapped+backfilling, last acting [34,57,16,60]
pg 3.16e is stuck unclean for 319022.053579, current state active+remapped+backfilling, last acting [7,11,58,63]
pg 3.1446 is stuck unclean for 723439.399481, current state active+remapped+backfilling, last acting [7,11,59,27]
pg 4.1130 is stuck unclean for 829487.905981, current state active+remapped+backfilling, last acting [24,30,32,59]
pg 4.1aff is stuck unclean for 329617.786292, current state active+remapped+backfilling, last acting [7,17,58,16]
pg 3.1404 is stuck unclean for 726186.450522, current state active+remapped+backfilling, last acting [32,33,60,12]
pg 4.1403 is stuck unclean for 780839.164866, current state active+remapped+backfilling, last acting [32,33,60,12]
pg 3.1f3e is active+remapped+backfilling, acting [62,33,35,27]
pg 4.1aff is active+remapped+backfilling, acting [7,17,58,16]
pg 3.1852 is active+remapped+backfilling, acting [5,59,37,35]
pg 4.1848 is active+degraded+remapped+backfilling, acting [12,37,62]
pg 4.175d is active+remapped+backfilling, acting [34,57,16,60]
pg 3.160b is active+remapped+backfilling, acting [58,33,5,57]
pg 3.1446 is active+remapped+backfilling, acting [7,11,59,27]
pg 4.1403 is active+remapped+backfilling, acting [32,33,60,12]
pg 3.1404 is active+remapped+backfilling, acting [32,33,60,12]
pg 3.13c4 is incomplete, acting [49,66,16]
pg 4.1130 is active+remapped+backfilling, acting [24,30,32,59]
pg 3.1019 is incomplete, acting [44,39,35]
pg 4.df8 is active+degraded+remapped+backfilling, acting [62,37]
pg 3.9e3 is active+degraded+remapped+backfilling, acting [0,17,30]
pg 4.252 is active+degraded+remapped+backfilling, acting [30,17,60]
pg 3.16e is active+remapped+backfilling, acting [7,11,58,63]
200 ops are blocked > 134218 sec
100 ops are blocked > 134218 sec on osd.44
100 ops are blocked > 134218 sec on osd.49
2 osds have slow requests
recovery 13269/10624283 objects degraded (0.125%)

There are 2 PG is in incomplete status. And not response from ceph ph pg# query. I have checked them in each related OSD. I find that 3.13c4_head & 3.1019_head are empty. So I think these data are lost.
L
I have try to mark the osd.44 & 49 as lost, but it has no effort.

BTW, some PGs are in stale status, I have resolved them via mkdir pg_head manually in related OSD.

And some unfound Objects, I cannot use ‘ceph pg # mark_unfound_lost revert|delete’ to resolve.
But fortunately, these objects are in the pool that I am not using now. So I remove the pool already.

So how to deal with the blocked ops in ‘100 ops are blocked > 134218 sec on osd.44；100 ops are blocked > 134218
 sec on osd.49’?
My openstack VMs are using ceph block storage, and they are hang now. I want to save as lives VMs as possible.

Best Regards!
Meng Chen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com