Re: Bug #1047 reproduced

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 22 Dec 2011 16:27:14 -0800



On Wed, Dec 21, 2011 at 8:36 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
> On Wednesday 21 December 2011 wrote Gregory Farnum:
>> On Wed, Dec 21, 2011 at 4:37 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote:
>> > On Friday 02 December 2011 wrote Sage Weil:
>> >> On Fri, 2 Dec 2011, Amon Ott wrote:
>> >> > On Thursday 01 December 2011 you wrote:
>> >> > > On all four nodes of my test cluster, MDS crashes with a trace like
>> >> > > that in bug #1047. Example and ceph.conf attached. Ceph server side
>> >> > > is from git master, last commit
>> >> > > ce6572273943ffdca4b7dc5344152d6c35106a2d.
>> >> > >
>> >> > > MDS does not start on any node here, it reliably crashes with that
>> >> > > assert.
>> >> >
>> >> > Does it makes sense for you to keep the cluster in that broken state,
>> >> > so that we can reproduce that bug or test a potential fix? Otherwise,
>> >> > I would recreate the Ceph filesystem to make more tests. I also have a
>> >> > full log of one mds from start to crash here.
>> >>
>> >> Can you attach the log to #1047 for posterity?  I'll take a quick look
>> >> and see if there is any further info to gain from the log.  I'm guessing
>> >> the actual bug occured before the crash, when the anchor table wasn't
>> >> updated properly, but there may be clues here.
>> >
>> > Did you find some time to look into this? The bug makes Ceph unusable for
>> > us even with moderate load. All mds instances die with the same assert,
>> > the only way to recover in that state is to recreate the complete ceph fs
>> > and restore backups.
>>
>> Sage is gone on vacation right now (unless he decides not to be for a
>> while), but we've been focusing our efforts on the OSDs lately so I
>> don't think he's looked at it. I'll see if I can carve out some time
>> tomorrow or Friday, but I can't promise anything.
>>
>> Alexandre, can you check this bug and make sure it looks like the same
>> one you reported as #1850?
>
> Thank you for looking into it. The behaviour in #1850 looks quite similar to
> our bug, apart from the hardlinks. We copy many files here in our tests, too.
> Last time I hit the bug I had really restarted the master mds.

Unfortunately there's not enough info in this log either. If you can
reproduce it with "mds debug = 20" and put that log somewhere, it
ought to be enough to work out what's going on, though. Sorry. :(
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html