On Wed, 11 Aug 2010, Christian Brunner wrote: > 2010/8/10 Sage Weil <sage@xxxxxxxxxxxx>: > > On Tue, 10 Aug 2010, Christian Brunner wrote: > > > >> After a bit more debuging I've found out that there seems to be a file > >> missing from the filestore: > >> > >> 10.08.10_18:14:07.862190 7f568d3e5710 filestore(/ceph/osd/osd02) > >> getattr /ceph/osd/osd02/current/3.f2_head/rb.0.1d6.00000000000e_head > >> '_' = -2 > > > > There was a bug last week in the kernel client rbd branch that was > > improperly encoding osd write operation payloads. Can you check that your > > rbd client is running 79c49720, which fixes it? > > I'm not using the kernel client. - The problem started after a crash > of the cosd. If you want me to, I can try to analyse the coredump. Just getting a backtrace from the core would be helpful. And in the do_osd_op frame, 'p osd_op.data._len', 'p bp.off'. > > That error above was probably just because that object hadn't been written > > yet, and isn't a fatal error. There is a 'scrub' function that verifies > > that most of the osd metadata is in order and replication is accurate: > > 'ceph osd scrub <osdnum>' and watch 'ceph -w' to see the success or error > > messages go by for each pg (or tail $mon_data/log on any monitor) > > I was not able to start the osd, so the scrub didn't run either. Every > time I tried to start it, it died with the message > "terminate called after throwing an instance of > 'ceph::buffer::end_of_buffer*'" after 3 seconds. > > The only thing that worked was to set up a whole new filesystem. From the log it looks like that was from some client resending the (malfomred?) request. I think killing the clients should have worked as well in this case. Thanks! sage