[ceph 0.72.2] PGs are incomplete status after some OSDs are out of cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Cepher

 

I meet the issue with ceph 0.72.2 that some OSDs are down & out and cannot rejoin the cluster via restart osd daemon.

Here below is I captured from the osd log which are down & out. All of them have the same error info.

 

2014-10-22 16:49:18.834850 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28)  error (39) Directory not empty not handled on operation 21 (72744853.0.2, or op 2, counting from 0)

2014-10-22 16:49:18.834897 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28) ENOTEMPTY suggests garbage data in osd data dir

2014-10-22 16:49:18.834899 7fc2d3fd9700  0 filestore(/var/lib/ceph/osd/ceph-28)  transaction dump:

{ "ops": [

        { "op_num": 0,

          "op_name": "remove",

          "collection": "meta",

          "oid": "bfa3b7d0\/pglog_3.194a\/0\/\/-1"},

        { "op_num": 1,

          "op_name": "omap_rmkeys",

          "collection": "meta",

          "oid": "16ef7597\/infos\/head\/\/-1"},

        { "op_num": 2,

          "op_name": "rmcoll",

          "collection": "3.194a_head"}]}

2014-10-22 16:49:18.951947 7fc2d3fd9700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7fc2d3fd9700 time 2014-10-22 16:49:18.859039

os/FileStore.cc: 2448: FAILED assert(0 == "unexpected error")

 

 

Before that 2 days, I take add some new OSDs into cluster and increase the PG #.

I manually add OSDs which is lost one by one follow the process of add new OSD, it takes a long time to recovey.

And now, ceph health detail shows as following, all OSDs are up and in.

 

HEALTH_WARN 14 pgs backfilling; 4 pgs degraded; 2 pgs incomplete; 2 pgs stuck inactive; 16 pgs stuck unclean; 200 requests are blocked > 32 sec; 2 osds have slow requests; recovery 13269/10624283 objects degraded (0.125%)

pg 3.1019 is stuck inactive since forever, current state incomplete, last acting [44,39,35]

pg 3.13c4 is stuck inactive since forever, current state incomplete, last acting [49,66,16]

pg 4.df8 is stuck unclean for 719511.538414, current state active+degraded+remapped+backfilling, last acting [62,37]

pg 3.1f3e is stuck unclean for 722343.421598, current state active+remapped+backfilling, last acting [62,33,35,27]

pg 3.1019 is stuck unclean since forever, current state incomplete, last acting [44,39,35]

pg 3.1852 is stuck unclean for 725266.838670, current state active+remapped+backfilling, last acting [5,59,37,35]

pg 3.13c4 is stuck unclean since forever, current state incomplete, last acting [49,66,16]

pg 4.1848 is stuck unclean for 828659.231379, current state active+degraded+remapped+backfilling, last acting [12,37,62]

pg 3.160b is stuck unclean for 726568.972467, current state active+remapped+backfilling, last acting [58,33,5,57]

pg 4.252 is stuck unclean for 757159.325059, current state active+degraded+remapped+backfilling, last acting [30,17,60]

pg 3.9e3 is stuck unclean for 722359.550585, current state active+degraded+remapped+backfilling, last acting [0,17,30]

pg 4.175d is stuck unclean for 784690.950635, current state active+remapped+backfilling, last acting [34,57,16,60]

pg 3.16e is stuck unclean for 319022.053579, current state active+remapped+backfilling, last acting [7,11,58,63]

pg 3.1446 is stuck unclean for 723439.399481, current state active+remapped+backfilling, last acting [7,11,59,27]

pg 4.1130 is stuck unclean for 829487.905981, current state active+remapped+backfilling, last acting [24,30,32,59]

pg 4.1aff is stuck unclean for 329617.786292, current state active+remapped+backfilling, last acting [7,17,58,16]

pg 3.1404 is stuck unclean for 726186.450522, current state active+remapped+backfilling, last acting [32,33,60,12]

pg 4.1403 is stuck unclean for 780839.164866, current state active+remapped+backfilling, last acting [32,33,60,12]

pg 3.1f3e is active+remapped+backfilling, acting [62,33,35,27]

pg 4.1aff is active+remapped+backfilling, acting [7,17,58,16]

pg 3.1852 is active+remapped+backfilling, acting [5,59,37,35]

pg 4.1848 is active+degraded+remapped+backfilling, acting [12,37,62]

pg 4.175d is active+remapped+backfilling, acting [34,57,16,60]

pg 3.160b is active+remapped+backfilling, acting [58,33,5,57]

pg 3.1446 is active+remapped+backfilling, acting [7,11,59,27]

pg 4.1403 is active+remapped+backfilling, acting [32,33,60,12]

pg 3.1404 is active+remapped+backfilling, acting [32,33,60,12]

pg 3.13c4 is incomplete, acting [49,66,16]

pg 4.1130 is active+remapped+backfilling, acting [24,30,32,59]

pg 3.1019 is incomplete, acting [44,39,35]

pg 4.df8 is active+degraded+remapped+backfilling, acting [62,37]

pg 3.9e3 is active+degraded+remapped+backfilling, acting [0,17,30]

pg 4.252 is active+degraded+remapped+backfilling, acting [30,17,60]

pg 3.16e is active+remapped+backfilling, acting [7,11,58,63]

200 ops are blocked > 134218 sec

100 ops are blocked > 134218 sec on osd.44

100 ops are blocked > 134218 sec on osd.49

2 osds have slow requests

recovery 13269/10624283 objects degraded (0.125%)

 

There are 2 PG is in incomplete status. And not response from ceph ph pg# query. I have checked them in each related OSD. I find that 3.13c4_head & 3.1019_head are empty. So I think these data are lost. L

I have try to mark the osd.44 & 49 as lost, but it has no effort.

 

BTW, some PGs are in stale status, I have resolved them via mkdir pg_head manually in related OSD.

And some unfound Objects, I cannot use ‘ceph pg # mark_unfound_lost revert|delete’ to resolve.

But fortunately, these objects are in the pool that I am not using now. So I remove the pool already.

 

So how to deal with the blocked ops in ‘100 ops are blocked > 134218 sec on osd.44100 ops are blocked > 134218 sec on osd.49’?

My openstack VMs are using ceph block storage, and they are hang now. I want to save as lives VMs as possible.

 

 

 

Best Regards!

Meng Chen

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux