On 22/05/2015 15:33, Adam Tygart wrote:
Hello all,
The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.
I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].
You found a bug, or more correctly you probably found multiple bugs...
It looks like your journal contains an EOpen event that lists 5307092
open files. Because the MDS only drops its lock between events, not
during processing a single one, this is causing the heartbeat map to
think the MDS has locked up, so it's getting killed.
So firstly we have to fix this to have appropriate calls into
MDS::heartbeat_reset while iterating over lists of unbounded length in
EMetablob::replay. That would fix the false death of the MDS resulting
from the heartbeat expiry.
Secondly, this EOpen was a 2.6GB log event. Something has almost
certainly gone wrong when we see that data structure grow so large, so
we should really be imposing some artificial cap there and catching the
situation earlier, rather than journal this monster event and only
hitting issues during replay.
Thirdly, something is apparently leading the MDS to think that 5 million
files were open in this particular log segment. It seems like an
improbable situation given that I can only see a single client in action
here. More investigation needed to see how this happened. Can you
describe the client workload that was going on in the run up to the
system breaking?
Anyway, actions:
1. I'm assuming your metadata is not sensitive, as you have shared this
debug log. Please could you use "cephfs-journal-tool journal export
~/journal.bin" to grab an offline copy of the raw journal, in case we
need to look at it later (this might take a while since your journal
seems so large, but the resulting file should compress reasonably well
with "tar cSzf").
2. optimistically, you may be able to get out of this situation by
modifying the mds_beacon_grace config option on the MDS (set it to
something high). This will cause the MDS to continue sending beacons to
the mons, even when a thread is failing to yield promptly (as in this
case), thereby preventing the mons from regarding the MDS as failed.
This hopefully will buy the MDS enough time to complete replay and come
up, assuming it doesn't run out of memory in the process of dealing with
whatever strangeness is in the journal.
3. If your MDS eventually makes it through recovery, unmount your client
and use "ceph daemon mds.<id> flush journal" to flush and trim the
journal: this should result in a situation where the next time the MDS
starts, the oversized journal entries are no longer present and startup
should go smoothly.
Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com