Re: cosd dying after start

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 11 Aug 2010 13:15:46 -0700 (PDT)

On Wed, 11 Aug 2010, Christian Brunner wrote:

> 2010/8/10 Sage Weil <sage@xxxxxxxxxxxx>:
> > On Tue, 10 Aug 2010, Christian Brunner wrote:
> >
> >> After a bit more debuging I've found out that there seems to be a file
> >> missing from the filestore:
> >>
> >> 10.08.10_18:14:07.862190 7f568d3e5710 filestore(/ceph/osd/osd02)
> >> getattr /ceph/osd/osd02/current/3.f2_head/rb.0.1d6.00000000000e_head
> >> '_' = -2
> >
> > There was a bug last week in the kernel client rbd branch that was
> > improperly encoding osd write operation payloads.  Can you check that your
> > rbd client is running 79c49720, which fixes it?
> 
> I'm not using the kernel client. - The problem started after a crash
> of the cosd. If you want me to, I can try to analyse the coredump.

Just getting a backtrace from the core would be helpful.  And in the 
do_osd_op frame, 'p osd_op.data._len', 'p bp.off'.  

> > That error above was probably just because that object hadn't been written
> > yet, and isn't a fatal error.  There is a 'scrub' function that verifies
> > that most of the osd metadata is in order and replication is accurate:
> > 'ceph osd scrub <osdnum>' and watch 'ceph -w' to see the success or error
> > messages go by for each pg (or tail $mon_data/log on any monitor)
> 
> I was not able to start the osd, so the scrub didn't run either. Every
> time I tried to start it, it died with the message
> "terminate called after throwing an instance of
> 'ceph::buffer::end_of_buffer*'" after 3 seconds.
> 
> The only thing that worked was to set up a whole new filesystem.

From the log it looks like that was from some client resending the 
(malfomred?) request.  I think killing the clients should have worked as 
well in this case.

Thanks!
sage