Empty osd and crushmap after mon restart?

Wido den Hollander <wido@xxxxxxxx> · Wed, 26 Jun 2013 01:06:19 +0200

Hi,

I'm not sure what happened, but on a Ceph cluster I noticed that the 
monitors (running 0.61) started filling up the disks, so they were 
restarted with:

mon compact on start = true

After a restart the osdmap was empty, it showed:

   osdmap e2: 0 osds: 0 up, 0 in
    pgmap v624077: 15296 pgs: 15296 stale+active+clean; 78104 MB data, 
243 GB used, 66789 GB / 67032 GB avail
   mdsmap e1: 0/0/1 up

This cluster has 36 OSDs over 9 hosts, but suddenly that was all gone.

I also checked the crushmap, all 36 OSDs were removed, no trace of them.

"ceph auth list" still showed their keys though.

Restarting the OSDs didn't help, since create-or-move complained that 
the OSDs didn't exist and didn't do anything. I ran "ceph osd create" to 
get the 36 OSDs created again, but when the OSDs boot they never start 
working.

The only thing they log is:

2013-06-26 01:00:08.852410 7f17f3f16700  0 -- 0.0.0.0:6801/4767 >> 
10.23.24.53:6801/1758 pipe(0x1025fc80 sd=116 :40516 s=1 pgs=0 cs=0 
l=0).fault with nothing to send, going to standby

The internet connection I'm behind is a 3G connection, so I can't go 
skimming through the logs with debugging at very high levels, but I'm 
just wondering what this could be?

It's obvious that the monitors filling up probably triggered the 
problem, but I'm now looking at a way to get the OSDs back up again.

In the meantime I upgraded all the nodes to 0.61.4, but that didn't 
change anything.

Any ideas on what this might be and how to resolve it?

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com