Re: which mds server is damaged?

Sage Weil <sweil@xxxxxxxxxx> · Thu, 14 Sep 2017 21:37:37 +0000 (UTC)

On Thu, 14 Sep 2017, Two Spirit wrote:
> >The thing that's damaged is a logical mds rank (0), not a physical MDS
> >daemon.  What this is telling you is that there is some serious
> >corruption in your metadata pool that prevents that particular rank
> >from starting.
> 
> Zhang helped identify the bug and put it in tracker. I didn't fully
> understand the problem but related to mds replay not happening and the
> write_pos being off. He fixed it after a full scrub was done, the
> degraded file system came back online. After a couple more hours of
> stress testing, the file system went back to degraded(earlier today).

Can you tell us more than "it went degraded"?  It's hard to know what 
you're seeing.

More generally, can you share what you did with the system that originally 
triggered the unfound object?  It ordinarily requires a sequence of 
multiple not-quite-concurrent failures to induce that state, and we don't 
see it much.  I'm surprised you're hitting it right off the bat.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html