Re: Ceph MDS continually respawning (hammer)

Adam Tygart <mozes@xxxxxxx> · Fri, 22 May 2015 13:34:22 -0500

On Fri, May 22, 2015 at 11:47 AM, John Spray <john.spray@xxxxxxxxxx> wrote:
>
>
> On 22/05/2015 15:33, Adam Tygart wrote:
>>
>> Hello all,
>>
>> The ceph-mds servers in our cluster are performing a constant
>> boot->replay->crash in our systems.
>>
>> I have enable debug logging for the mds for a restart cycle on one of
>> the nodes[1].
>
>
> You found a bug, or more correctly you probably found multiple bugs...
>
> It looks like your journal contains an EOpen event that lists 5307092 open
> files.  Because the MDS only drops its lock between events, not during
> processing a single one, this is causing the heartbeat map to think the MDS
> has locked up, so it's getting killed.
>
> So firstly we have to fix this to have appropriate calls into
> MDS::heartbeat_reset while iterating over lists of unbounded length in
> EMetablob::replay.  That would fix the false death of the MDS resulting from
> the heartbeat expiry.
>
> Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
> gone wrong when we see that data structure grow so large, so we should
> really be imposing some artificial cap there and catching the situation
> earlier, rather than journal this monster event and only hitting issues
> during replay.
>
> Thirdly, something is apparently leading the MDS to think that 5 million
> files were open in this particular log segment.  It seems like an improbable
> situation given that I can only see a single client in action here.  More
> investigation needed to see how this happened.  Can you describe the client
> workload that was going on in the run up to the system breaking?

We are in the process of rsyncing 60+ million files from our old
fileservers into the new cephfs stores. We are down to the last user,
and it seems he has at least 25% of tthe total files. we were copying
files at a rate of about 2 million a day before this crash.

>
> Anyway, actions:
>
> 1. I'm assuming your metadata is not sensitive, as you have shared this
> debug log.  Please could you use "cephfs-journal-tool journal export
> ~/journal.bin" to grab an offline copy of the raw journal, in case we need
> to look at it later (this might take a while since your journal seems so
> large, but the resulting file should compress reasonably well with "tar
> cSzf").

Problematic. cephfs-journal-tool is giving me a stacktrace:
http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

> 2. optimistically, you may be able to get out of this situation by modifying
> the mds_beacon_grace config option on the MDS (set it to something high).
> This will cause the MDS to continue sending beacons to the mons, even when a
> thread is failing to yield promptly (as in this case), thereby preventing
> the mons from regarding the MDS as failed.  This hopefully will buy the MDS
> enough time to complete replay and come up, assuming it doesn't run out of
> memory in the process of dealing with whatever strangeness is in the
> journal.

I can hold off on this for a bit if you have something else you'd like
me to save while it is in its broken state.

Thanks for the detailed writeup.
--
Adam

> 3. If your MDS eventually makes it through recovery, unmount your client and
> use "ceph daemon mds.<id> flush journal" to flush and trim the journal: this
> should result in a situation where the next time the MDS starts, the
> oversized journal entries are no longer present and startup should go
> smoothly.
>
> Cheers,
> John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com