On 2011. August 4. 20:14:54 Yehuda Sadeh Weinraub wrote: > 2011/8/3 Székelyi Szabolcs <szekelyi@xxxxxxx>: > > I'm running ceph 0.32, and since a while it looks like if a monitor > > fails, then the cluster doesn't find a new one. > > > > I have three nodes, two with cmds+cosd+cmon, and one with cmds+cmon, > > which is also running the client. If I stop one of the cmds+cosd+cmon > > nodes, ceph -w run on the cmds+cmon node tells nothing but > > > > 2011-08-03 11:10:47.291875 7f4f043d5700 -- <client_ip>:0/14633 >> > > <killed_node_ip>:6789/0 pipe(0x1a7f9c0 sd=4 pgs=0 cs=0 l=0).fault first > > fault > > > > infinitely and the filesystem stops working (processes using files in it > > block forever). Looks like it rties to connect to the killed monitor > > instead of failing over to a working one. > > > > The first message after killing the node was: > > > > 2011-08-03 10:57:40.687871 7f4f01563700 monclient: hunting for new mon > > > > Do you have any idea what I'm doing wrong? > > Do you have the mon logs, or any core files? No, and I nuked my ceph cluster since then, because I realized that the monmap was kinda screwed up. When I ran the client with the -m option, it reported errors that it's unable to connect to some strange hostnames with binary characters in it. I guess it got broken in the tortures I've put my cluster under. I'll report if I experience anything like this with the fresh cluster. Thanks, -- cc -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html