Fair enough. Anyway, is it safe to now increase the 'mds beacon grace' to try and get the mds server functional again? I realize there is nothing simple about the things that are being accomplished here, and thank everyone for their hard work on making this stuff work as well as it does. -- Adam On Fri, May 22, 2015 at 2:39 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > On Fri, May 22, 2015 at 12:34 PM, Adam Tygart <mozes@xxxxxxx> wrote: >> I believe I grabbed all of theses files: >> >> for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p >> metadata get ${x} /tmp/metadata/${x}; done >> tar czSf journal.tar.gz /tmp/metadata >> >> https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing >> >> When this crash occurred, the rsync process should have been going >> through all of this sequentially, one file at a time. There weren't >> simultaneous rsyncs, and hadn't been for at least a couple days. I >> would have hoped that if the files hadn't been touched in days that >> *something* would have forcibly closed them and flushed them out of >> the caches. I also would have thought that a close file/flush in rsync >> (which I am sure it does, after finishing writing a file) would have >> let them close in the cephfs session. > > This is tricker than it sounds with the kernel dcache, unfortunately. > We improved it a bit just last week but we'll have to try and diagnose > what happened in this case more before we can say if it was that issue > or something else. > -Greg > >> >> -- >> Adam >> >> On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >>> On Fri, May 22, 2015 at 11:34 AM, Adam Tygart <mozes@xxxxxxx> wrote: >>>> On Fri, May 22, 2015 at 11:47 AM, John Spray <john.spray@xxxxxxxxxx> wrote: >>>>> >>>>> >>>>> On 22/05/2015 15:33, Adam Tygart wrote: >>>>>> >>>>>> Hello all, >>>>>> >>>>>> The ceph-mds servers in our cluster are performing a constant >>>>>> boot->replay->crash in our systems. >>>>>> >>>>>> I have enable debug logging for the mds for a restart cycle on one of >>>>>> the nodes[1]. >>>>> >>>>> >>>>> You found a bug, or more correctly you probably found multiple bugs... >>>>> >>>>> It looks like your journal contains an EOpen event that lists 5307092 open >>>>> files. Because the MDS only drops its lock between events, not during >>>>> processing a single one, this is causing the heartbeat map to think the MDS >>>>> has locked up, so it's getting killed. >>>>> >>>>> So firstly we have to fix this to have appropriate calls into >>>>> MDS::heartbeat_reset while iterating over lists of unbounded length in >>>>> EMetablob::replay. That would fix the false death of the MDS resulting from >>>>> the heartbeat expiry. >>>>> >>>>> Secondly, this EOpen was a 2.6GB log event. Something has almost certainly >>>>> gone wrong when we see that data structure grow so large, so we should >>>>> really be imposing some artificial cap there and catching the situation >>>>> earlier, rather than journal this monster event and only hitting issues >>>>> during replay. >>>>> >>>>> Thirdly, something is apparently leading the MDS to think that 5 million >>>>> files were open in this particular log segment. It seems like an improbable >>>>> situation given that I can only see a single client in action here. More >>>>> investigation needed to see how this happened. Can you describe the client >>>>> workload that was going on in the run up to the system breaking? >>>> >>>> We are in the process of rsyncing 60+ million files from our old >>>> fileservers into the new cephfs stores. We are down to the last user, >>>> and it seems he has at least 25% of tthe total files. we were copying >>>> files at a rate of about 2 million a day before this crash. >>>> >>>>> >>>>> Anyway, actions: >>>>> >>>>> 1. I'm assuming your metadata is not sensitive, as you have shared this >>>>> debug log. Please could you use "cephfs-journal-tool journal export >>>>> ~/journal.bin" to grab an offline copy of the raw journal, in case we need >>>>> to look at it later (this might take a while since your journal seems so >>>>> large, but the resulting file should compress reasonably well with "tar >>>>> cSzf"). >>>> >>>> Problematic. cephfs-journal-tool is giving me a stacktrace: >>>> http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt >>>> >>>>> 2. optimistically, you may be able to get out of this situation by modifying >>>>> the mds_beacon_grace config option on the MDS (set it to something high). >>>>> This will cause the MDS to continue sending beacons to the mons, even when a >>>>> thread is failing to yield promptly (as in this case), thereby preventing >>>>> the mons from regarding the MDS as failed. This hopefully will buy the MDS >>>>> enough time to complete replay and come up, assuming it doesn't run out of >>>>> memory in the process of dealing with whatever strangeness is in the >>>>> journal. >>>> >>>> I can hold off on this for a bit if you have something else you'd like >>>> me to save while it is in its broken state. >>> >>> >>> Ugh. We appear to be trying to allocate too much memory for this event >>> in the journal dump; we'll need to fix this. :( >>> In the meantime, it's a harder workflow but if you can grab all the >>> objects in the metadata pool whose name starts with "200." that's the >>> raw data that the journal dumper is trying to grab and we can assemble >>> it ourselves for checking later on. >>> >>> That said, you might actually have that many files still "open" in >>> some fashion from your clients. If you can dump caches on them >>> periodically, and/or limit the number of files that rsync is working >>> on at once, that should help keep this from happening again. >>> -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com