Re: Ceph MDS continually respawning (hammer)

Adam Tygart <mozes@xxxxxxx> · Fri, 22 May 2015 14:45:44 -0500

Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
to try and get the mds server functional again?

I realize there is nothing simple about the things that are being
accomplished here, and thank everyone for their hard work on making
this stuff work as well as it does.

--
Adam

On Fri, May 22, 2015 at 2:39 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Fri, May 22, 2015 at 12:34 PM, Adam Tygart <mozes@xxxxxxx> wrote:
>> I believe I grabbed all of theses files:
>>
>> for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
>> metadata get ${x} /tmp/metadata/${x}; done
>> tar czSf journal.tar.gz /tmp/metadata
>>
>> https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing
>>
>> When this crash occurred, the rsync process should have been going
>> through all of this sequentially, one file at a time. There weren't
>> simultaneous rsyncs, and hadn't been for at least a couple days. I
>> would have hoped that if the files hadn't been touched in days that
>> *something* would have forcibly closed them and flushed them out of
>> the caches. I also would have thought that a close file/flush in rsync
>> (which I am sure it does, after finishing writing a file) would have
>> let them close in the cephfs session.
>
> This is tricker than it sounds with the kernel dcache, unfortunately.
> We improved it a bit just last week but we'll have to try and diagnose
> what happened in this case more before we can say if it was that issue
> or something else.
> -Greg
>
>>
>> --
>> Adam
>>
>> On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> On Fri, May 22, 2015 at 11:34 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>>> On Fri, May 22, 2015 at 11:47 AM, John Spray <john.spray@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> On 22/05/2015 15:33, Adam Tygart wrote:
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> The ceph-mds servers in our cluster are performing a constant
>>>>>> boot->replay->crash in our systems.
>>>>>>
>>>>>> I have enable debug logging for the mds for a restart cycle on one of
>>>>>> the nodes[1].
>>>>>
>>>>>
>>>>> You found a bug, or more correctly you probably found multiple bugs...
>>>>>
>>>>> It looks like your journal contains an EOpen event that lists 5307092 open
>>>>> files.  Because the MDS only drops its lock between events, not during
>>>>> processing a single one, this is causing the heartbeat map to think the MDS
>>>>> has locked up, so it's getting killed.
>>>>>
>>>>> So firstly we have to fix this to have appropriate calls into
>>>>> MDS::heartbeat_reset while iterating over lists of unbounded length in
>>>>> EMetablob::replay.  That would fix the false death of the MDS resulting from
>>>>> the heartbeat expiry.
>>>>>
>>>>> Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
>>>>> gone wrong when we see that data structure grow so large, so we should
>>>>> really be imposing some artificial cap there and catching the situation
>>>>> earlier, rather than journal this monster event and only hitting issues
>>>>> during replay.
>>>>>
>>>>> Thirdly, something is apparently leading the MDS to think that 5 million
>>>>> files were open in this particular log segment.  It seems like an improbable
>>>>> situation given that I can only see a single client in action here.  More
>>>>> investigation needed to see how this happened.  Can you describe the client
>>>>> workload that was going on in the run up to the system breaking?
>>>>
>>>> We are in the process of rsyncing 60+ million files from our old
>>>> fileservers into the new cephfs stores. We are down to the last user,
>>>> and it seems he has at least 25% of tthe total files. we were copying
>>>> files at a rate of about 2 million a day before this crash.
>>>>
>>>>>
>>>>> Anyway, actions:
>>>>>
>>>>> 1. I'm assuming your metadata is not sensitive, as you have shared this
>>>>> debug log.  Please could you use "cephfs-journal-tool journal export
>>>>> ~/journal.bin" to grab an offline copy of the raw journal, in case we need
>>>>> to look at it later (this might take a while since your journal seems so
>>>>> large, but the resulting file should compress reasonably well with "tar
>>>>> cSzf").
>>>>
>>>> Problematic. cephfs-journal-tool is giving me a stacktrace:
>>>> http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt
>>>>
>>>>> 2. optimistically, you may be able to get out of this situation by modifying
>>>>> the mds_beacon_grace config option on the MDS (set it to something high).
>>>>> This will cause the MDS to continue sending beacons to the mons, even when a
>>>>> thread is failing to yield promptly (as in this case), thereby preventing
>>>>> the mons from regarding the MDS as failed.  This hopefully will buy the MDS
>>>>> enough time to complete replay and come up, assuming it doesn't run out of
>>>>> memory in the process of dealing with whatever strangeness is in the
>>>>> journal.
>>>>
>>>> I can hold off on this for a bit if you have something else you'd like
>>>> me to save while it is in its broken state.
>>>
>>>
>>> Ugh. We appear to be trying to allocate too much memory for this event
>>> in the journal dump; we'll need to fix this. :(
>>> In the meantime, it's a harder workflow but if you can grab all the
>>> objects in the metadata pool whose name starts with "200." that's the
>>> raw data that the journal dumper is trying to grab and we can assemble
>>> it ourselves for checking later on.
>>>
>>> That said, you might actually have that many files still "open" in
>>> some fashion from your clients. If you can dump caches on them
>>> periodically, and/or limit the number of files that rsync is working
>>> on at once, that should help keep this from happening again.
>>> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com