Re: MDS stuck in a crash loop

Milosz Tanski <milosz@xxxxxxxxx> · Thu, 22 Oct 2015 08:43:52 -0400

On Wed, Oct 21, 2015 at 5:33 PM, John Spray <jspray@xxxxxxxxxx> wrote:
> On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jspray@xxxxxxxxxx> wrote:
>>> John, I know you've got
>>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>>> supposed to be for this, but I'm not sure if you spotted any issues
>>> with it or if we need to do some more diagnosing?
>>
>> That test path is just verifying that we do handle dirs without dying
>> in at least one case -- it passes with the existing ceph code, so it's
>> not reproducing this issue.
>
> Clicked send to soon, I was about to add...
>
> Milosz mentioned that they don't have the data from the system in the
> broken state, so I don't have any bright ideas about learning more
> about what went wrong here unfortunately.
>

Sorry about that, wasn't thinking at the time and just wanted to get
this up and going as quickly as possible :(

If this happens next time I'll be more careful to keep more evidence.
I think multi-fs in the same rados namespace support would actually
helped here, since it makes it easier to create a newfs and leave the
other one around (for investigation)

But makes me wonder that the broken dir scenario can probably be
replicated by hand using rados calls. There's a pretty generic ticket
there for don't die on dir errors, but I imagine the code can be
audited and steps to cause a synthetic error can be produced.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html