Re: Destroyed Ceph Cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello List,

The troubles to fix such a cluster continue... I get output like this now:

# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is degraded; mds vvx-ceph-m-03 is laggy


When checking for the ceph-mds processes, there are now none left... no matter which server I check. And the won't start up again!?

The log starts up with:
2013-08-19 11:23:30.503214 7f7e9dfbd780 0 ceph version 0.67 (e3b7bc5bce8ab330ec1661381072368af3c218a0), process ceph-mds, pid 27636
2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
2013-08-19 11:23:30.529418 7f7e9904b700 1 mds.0.26 handle_mds_map i am now mds.0.26 2013-08-19 11:23:30.529423 7f7e9904b700 1 mds.0.26 handle_mds_map state change up:standby --> up:replay
2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
2013-08-19 11:23:30.529436 7f7e9904b700 1 mds.0.26 need osdmap epoch 277, have 276 2013-08-19 11:23:30.529438 7f7e9904b700 1 mds.0.26 waiting for osdmap 277 (which blacklists prior instance) 2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory 2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f7e9904b700 time 2013-08-19 11:23:30.534107
mds/SessionMap.cc: 83: FAILED assert(0 == "failed to load sessionmap")


Anyone an idea how to get the cluster back running?





Georg




On 16.08.2013 16:23, Mark Nelson wrote:
Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:
Hello,

I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with "ceph pg force_create_pg"


HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
are blocked > 32 sec; mds cluster is degraded; 1 mons down, quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03

Any idea how to fix the cluster, besides completley rebuilding the
cluster from scratch? What if such a thing happens in a production
environment...

The pgs from "ceph pg dump" looks all like creating for some time now:

2.3d    0       0       0       0       0       0       0 creating
      2013-08-16 13:43:08.186537       0'0     0:0 []      [] 0'0
0.0000000'0     0.000000

Is there a way to just dump the data, that was on the discarded OSDs?




Kind Regards,
Georg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux