Re: About single monitor recovery

Sage Weil <sage@xxxxxxxxxxx> · Mon, 5 Aug 2013 09:28:32 -0700 (PDT)

On Mon, 5 Aug 2013, Yu Changyuan wrote:
> The good news is, with new patch, ceph start OK, cephfs mount OK, and kvm
> virtual machine use rbd boot OK(and seems running ok), and I check the
> timestamp of last file write to cephfs, it's fair near to the time of
> reboot(which cause ceph not work any more). Since I don't have any other way
> to check the integrity of  the files store in cephfs, I just randomly pick
> some video files, and play it, all seems OK.
> 
> So, thank you very much.
> 
> But, I do not use the last version of files in /var/lib/ceph/mon/ceph-a,
> with these files, ceph-mon startup ok, and ceph -s returns, but osd still
> think the monitor is wrong node and refuse to work.
> Then I think I may try the files of 2 day ago(Aug 1st) and see what happen,
> and something actually happen, that is ceph-osd start to work.
> So, I am a bit curious about why patched version work with the ceph-mon data
> 2 days ago but original version not,
> and what more important, do I need extra step to make current running ceph
> cluster to work with a normal version(not patched) ceph,
> and are there any chance that current cluster will run into problem in the
> future(keep current state and do not take any extra step).

I think you will be fine with the current state and switching back to 
normal release code.

I'm confused why ceph-osds wouldn't start with the latest mon data, but 
can't speculate too much without spending time analyzing your logs from 
the failed startup. 

Glad to hear you're back online!
sage

> 
> 
> 
> On Mon, Aug 5, 2013 at 12:39 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>       On Sun, 4 Aug 2013, Yu Changyuan wrote:
> > And here is the log of ceph-mon, with debug_mon set to 10, I run
> "ceph -s"
> > command(which is blocked) on 192.168.1.2 during recording this log.
> >
> > https://gist.github.com/yuchangyuan/ba3e72452215221d1e82
> 
> I pushed one more patch to that branch that should get you up.  This
> one
> should go to master as well.
> 
> sage
> 
> >
> >
> > On Sun, Aug 4, 2013 at 3:25 PM, Yu Changyuan <reivzy@xxxxxxxxx>
> wrote:
> >       I just try the branch, and mon start ok, here is the log:
> >       https://gist.github.com/yuchangyuan/3138952ac60508d18aed
> >       But ceph -s or ceph -w just block, without any message
> return(I
> >       just start monitor, no mds or osd).
> >
> >
> >
> > On Sun, Aug 4, 2013 at 12:23 PM, Yu Changyuan <reivzy@xxxxxxxxx>
> > wrote:
> >
> >       On Sun, Aug 4, 2013 at 12:16 PM, Sage Weil
> >       <sage@xxxxxxxxxxx> wrote:
> >             It looks like the auth state wasn't trimmed
> >             properly.  It also sort of
> >             looks like you aren't using authentication on
> >             this cluster... is that
> >             true?  (The keyring file was empty.)
> >
> > Yes, your're right, I disable auth. It's just a personal
> > cluster, so the simpler the better.
> >
> >       This looks like a trim issue, but I don't remember
> >       what all we fixed since
> >       .1.. that was a while ago!  We certainly haven't
> >       seen anything like this
> >       recently.
> >
> >       I pushed a branch wip-mon-skip-auth-cuttlefish that
> >       skips the missing
> >       incrementals and will get your mon up, but you may
> >       lose some auth keys.
> >       If auth is on, you'll need ot add them back again.
> >        If not, it may just
> >       work with this.
> >
> >       You can grab the packages from
> >
> > http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/wip-mon-skip-
> 
> >       auth-cuttlefish
> >
> >       or whatever the right dir is for your distro when
> >       they appear in about 15
> >       minutes.  Let me know if that resolves it.
> >
> >  
> > Thank you for your work, I will try as soon as possible.
> > PS: My distro is Gentoo, so maybe I should build from source
> > directly.
> >  
> >
> >       sage
> >
> >
> >       On Sun, 4 Aug 2013, Yu Changyuan wrote:
> >
> >       >
> >       >
> >       >
> >       > On Sun, Aug 4, 2013 at 12:13 AM, Sage Weil
> >       <sage@xxxxxxxxxxx> wrote:
> >       >       On Sat, 3 Aug 2013, Yu Changyuan wrote:
> >       >       > I run a tiny ceph cluster with only one
> >       monitor. After a
> >       >       reboot the system,
> >       >       > the monitor refuse to start.
> >       >       > I try to start ceph-mon manually with
> >       command 'ceph -f -i a',
> >       >        below is
> >       >       > first few lines of the output:
> >       >       >
> >       >       > starting mon.a rank 0 at
> >       192.168.1.10:6789/0 mon_data
> >       >       > /var/lib/ceph/mon/ceph-a fsid
> >       >       554bee60-9602-4017-a6e1-ceb6907a218c
> >       >       > mon/AuthMonitor.cc: In function 'virtual
> >       void
> >       >       > AuthMonitor::update_from_paxos()' thread
> >       7f9e3b0db780 time
> >       >       2013-08-03
> >       >       > 20:27:29.208156
> >       >       > mon/AuthMonitor.cc: 147: FAILED assert(ret
> >       == 0)
> >       >       >
> >       >       > The full log is at:
> >       >      
> >       https://gist.github.com/yuchangyuan/0a0a56a14fa4649ec2c8
> >       >
> >       > This is 0.61.1.  Can you try again with 0.61.7 to
> >       rule out anything
> >       > there?
> >       >
> >       >  
> >       > I just tried 0.61.7, still out of luck. Here is
> >       the log: 
> >       >
> >       https://gist.github.com/yuchangyuan/34743c0abf1bfd8ef243
> >       >
> >       >
> >       >       > So, are there any way to make the monitor
> >       work again?
> >       >       >
> >       >       > I have a backup of
> >       /var/lib/ceph/mon/ceph-a  in 2013-08-01,
> >       >       and success
> >       >       > start the monitor with these files,
> >       >       > but rados and other command not work
> >       because osd keep saying
> >       >       the monitor is
> >       >       > the wrong node(that's right, it's actually
> >       the node 2 days
> >       >       ago).
> >       >
> >       > In general that is not going to work well as the
> >       cluster does not like
> >       > to
> >       > warp back in time.  If it does not start with .7
> >       (I suspect it won't),
> >       > can
> >       > you send us a tarball of the mon data directory so
> >       we can see what is
> >       > awry? 
> >       >
> >       >  
> >       > OK, I will send the tarball of
> >       /var/lib/ceph/mon/ceph-a to you directly.
> >       >  
> >       >
> >       >       sage
> >       >
> >       >
> >       >
> >       >
> >       > --
> >       > Best regards,
> >       > Changyuan
> >       >
> >       >
> >
> >
> >
> >
> > --
> > Best regards,
> > Changyuan
> >
> >
> >
> >
> > --
> > Best regards,
> > Changyuan
> >
> >
> >
> >
> > --
> > Best regards,
> > Changyuan
> >
> >
> 
> 
> 
> 
> --
> Best regards,
> Changyuan
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com