Re: Destroyed Ceph Cluster

Georg Höllrigl <georg.hoellrigl@xxxxxxxxxx> · Mon, 19 Aug 2013 11:26:23 +0200

Hello List,

The troubles to fix such a cluster continue... I get output like this now:

# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is 
degraded; mds vvx-ceph-m-03 is laggy

When checking for the ceph-mds processes, there are now none left... no 
matter which server I check. And the won't start up again!?

The log starts up with:
2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67 
(e3b7bc5bce8ab330ec1661381072368af3c218a0), process ceph-mds, pid 27636
2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i am 
now mds.0.26
2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map state 
change up:standby --> up:replay
2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap epoch 
277, have 276
2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for osdmap 
277 (which blacklists prior instance)
2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish 
got (2) No such file or directory
2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In 
function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 
7f7e9904b700 time 2013-08-19 11:23:30.534107
mds/SessionMap.cc: 83: FAILED assert(0 == "failed to load sessionmap")

Anyone an idea how to get the cluster back running?

Georg

On 16.08.2013 16:23, Mark Nelson wrote:
Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:
Hello,

I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with "ceph pg force_create_pg"

HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
are blocked > 32 sec; mds cluster is degraded; 1 mons down, quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03

Any idea how to fix the cluster, besides completley rebuilding the
cluster from scratch? What if such a thing happens in a production
environment...

The pgs from "ceph pg dump" looks all like creating for some time now:

2.3d    0       0       0       0       0       0       0 creating
      2013-08-16 13:43:08.186537       0'0     0:0 []      [] 0'0
0.0000000'0     0.000000

Is there a way to just dump the data, that was on the discarded OSDs?

Kind Regards,
Georg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com