Hi, Cepher I meet the issue with ceph 0.72.2 that some OSDs are down & out and cannot rejoin the cluster via restart osd daemon. Here below is I captured from the osd log which are down & out. All of them have the same error info. 2014-10-22 16:49:18.834850 7fc2d3fd9700 0 filestore(/var/lib/ceph/osd/ceph-28) error (39) Directory not empty not handled on operation
21 (72744853.0.2, or op 2, counting from 0) 2014-10-22 16:49:18.834897 7fc2d3fd9700 0 filestore(/var/lib/ceph/osd/ceph-28) ENOTEMPTY suggests garbage data in osd data dir 2014-10-22 16:49:18.834899 7fc2d3fd9700 0 filestore(/var/lib/ceph/osd/ceph-28) transaction dump: { "ops": [ { "op_num": 0, "op_name": "remove", "collection": "meta", "oid": "bfa3b7d0\/pglog_3.194a\/0\/\/-1"}, { "op_num": 1, "op_name": "omap_rmkeys", "collection": "meta", "oid": "16ef7597\/infos\/head\/\/-1"}, { "op_num": 2, "op_name": "rmcoll", "collection": "3.194a_head"}]} 2014-10-22 16:49:18.951947 7fc2d3fd9700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7fc2d3fd9700 time 2014-10-22 16:49:18.859039 os/FileStore.cc: 2448: FAILED assert(0 == "unexpected error") Before that 2 days, I take add some new OSDs into cluster and increase the PG #. I manually add OSDs which is lost one by one follow the process of add new OSD, it takes a long time to recovey. And now, ceph health detail shows as following, all OSDs are up and in. HEALTH_WARN 14 pgs backfilling; 4 pgs degraded; 2 pgs incomplete; 2 pgs stuck inactive; 16 pgs stuck unclean; 200 requests are blocked > 32 sec; 2 osds have slow requests; recovery 13269/10624283 objects degraded (0.125%) pg 3.1019 is stuck inactive since forever, current state incomplete, last acting [44,39,35] pg 3.13c4 is stuck inactive since forever, current state incomplete, last acting [49,66,16] pg 4.df8 is stuck unclean for 719511.538414, current state active+degraded+remapped+backfilling, last acting [62,37] pg 3.1f3e is stuck unclean for 722343.421598, current state active+remapped+backfilling, last acting [62,33,35,27] pg 3.1019 is stuck unclean since forever, current state incomplete, last acting [44,39,35] pg 3.1852 is stuck unclean for 725266.838670, current state active+remapped+backfilling, last acting [5,59,37,35] pg 3.13c4 is stuck unclean since forever, current state incomplete, last acting [49,66,16] pg 4.1848 is stuck unclean for 828659.231379, current state active+degraded+remapped+backfilling, last acting [12,37,62] pg 3.160b is stuck unclean for 726568.972467, current state active+remapped+backfilling, last acting [58,33,5,57] pg 4.252 is stuck unclean for 757159.325059, current state active+degraded+remapped+backfilling, last acting [30,17,60] pg 3.9e3 is stuck unclean for 722359.550585, current state active+degraded+remapped+backfilling, last acting [0,17,30] pg 4.175d is stuck unclean for 784690.950635, current state active+remapped+backfilling, last acting [34,57,16,60] pg 3.16e is stuck unclean for 319022.053579, current state active+remapped+backfilling, last acting [7,11,58,63] pg 3.1446 is stuck unclean for 723439.399481, current state active+remapped+backfilling, last acting [7,11,59,27] pg 4.1130 is stuck unclean for 829487.905981, current state active+remapped+backfilling, last acting [24,30,32,59] pg 4.1aff is stuck unclean for 329617.786292, current state active+remapped+backfilling, last acting [7,17,58,16] pg 3.1404 is stuck unclean for 726186.450522, current state active+remapped+backfilling, last acting [32,33,60,12] pg 4.1403 is stuck unclean for 780839.164866, current state active+remapped+backfilling, last acting [32,33,60,12] pg 3.1f3e is active+remapped+backfilling, acting [62,33,35,27] pg 4.1aff is active+remapped+backfilling, acting [7,17,58,16] pg 3.1852 is active+remapped+backfilling, acting [5,59,37,35] pg 4.1848 is active+degraded+remapped+backfilling, acting [12,37,62] pg 4.175d is active+remapped+backfilling, acting [34,57,16,60] pg 3.160b is active+remapped+backfilling, acting [58,33,5,57] pg 3.1446 is active+remapped+backfilling, acting [7,11,59,27] pg 4.1403 is active+remapped+backfilling, acting [32,33,60,12] pg 3.1404 is active+remapped+backfilling, acting [32,33,60,12] pg 3.13c4 is incomplete, acting [49,66,16] pg 4.1130 is active+remapped+backfilling, acting [24,30,32,59] pg 3.1019 is incomplete, acting [44,39,35] pg 4.df8 is active+degraded+remapped+backfilling, acting [62,37] pg 3.9e3 is active+degraded+remapped+backfilling, acting [0,17,30] pg 4.252 is active+degraded+remapped+backfilling, acting [30,17,60] pg 3.16e is active+remapped+backfilling, acting [7,11,58,63] 200 ops are blocked > 134218 sec 100 ops are blocked > 134218 sec on osd.44 100 ops are blocked > 134218 sec on osd.49 2 osds have slow requests recovery 13269/10624283 objects degraded (0.125%) There are 2 PG is in incomplete status. And not response from ceph ph pg# query. I have checked them in each related OSD. I find that 3.13c4_head & 3.1019_head are empty. So I think these data are lost.
L I have try to mark the osd.44 & 49 as lost, but it has no effort. BTW, some PGs are in stale status, I have resolved them via mkdir pg_head manually in related OSD.
And some unfound Objects, I cannot use ‘ceph pg # mark_unfound_lost revert|delete’ to resolve. But fortunately, these objects are in the pool that I am not using now. So I remove the pool already. So how to deal with the blocked ops in ‘100 ops are blocked > 134218 sec on osd.44;100 ops are blocked > 134218
sec on osd.49’? My openstack VMs are using ceph block storage, and they are hang now. I want to save as lives VMs as possible. Best Regards! Meng Chen |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com