Re: Ceph MDS continually respawning (hammer)

John Spray <john.spray@xxxxxxxxxx> · Fri, 22 May 2015 17:47:12 +0100

On 22/05/2015 15:33, Adam Tygart wrote:
Hello all,

The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.

I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].

You found a bug, or more correctly you probably found multiple bugs...

It looks like your journal contains an EOpen event that lists 5307092 
open files.  Because the MDS only drops its lock between events, not 
during processing a single one, this is causing the heartbeat map to 
think the MDS has locked up, so it's getting killed.

So firstly we have to fix this to have appropriate calls into 
MDS::heartbeat_reset while iterating over lists of unbounded length in 
EMetablob::replay.  That would fix the false death of the MDS resulting 
from the heartbeat expiry.

Secondly, this EOpen was a 2.6GB log event.  Something has almost 
certainly gone wrong when we see that data structure grow so large, so 
we should really be imposing some artificial cap there and catching the 
situation earlier, rather than journal this monster event and only 
hitting issues during replay.

Thirdly, something is apparently leading the MDS to think that 5 million 
files were open in this particular log segment.  It seems like an 
improbable situation given that I can only see a single client in action 
here.  More investigation needed to see how this happened.  Can you 
describe the client workload that was going on in the run up to the 
system breaking?

Anyway, actions:

1. I'm assuming your metadata is not sensitive, as you have shared this 
debug log.  Please could you use "cephfs-journal-tool journal export 
~/journal.bin" to grab an offline copy of the raw journal, in case we 
need to look at it later (this might take a while since your journal 
seems so large, but the resulting file should compress reasonably well 
with "tar cSzf").

2. optimistically, you may be able to get out of this situation by 
modifying the mds_beacon_grace config option on the MDS (set it to 
something high).  This will cause the MDS to continue sending beacons to 
the mons, even when a thread is failing to yield promptly (as in this 
case), thereby preventing the mons from regarding the MDS as failed.  
This hopefully will buy the MDS enough time to complete replay and come 
up, assuming it doesn't run out of memory in the process of dealing with 
whatever strangeness is in the journal.

3. If your MDS eventually makes it through recovery, unmount your client 
and use "ceph daemon mds.<id> flush journal" to flush and trim the 
journal: this should result in a situation where the next time the MDS 
starts, the oversized journal entries are no longer present and startup 
should go smoothly.

Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com