On Fri, May 22, 2015 at 11:47 AM, John Spray <john.spray@xxxxxxxxxx> wrote: > > > On 22/05/2015 15:33, Adam Tygart wrote: >> >> Hello all, >> >> The ceph-mds servers in our cluster are performing a constant >> boot->replay->crash in our systems. >> >> I have enable debug logging for the mds for a restart cycle on one of >> the nodes[1]. > > > You found a bug, or more correctly you probably found multiple bugs... > > It looks like your journal contains an EOpen event that lists 5307092 open > files. Because the MDS only drops its lock between events, not during > processing a single one, this is causing the heartbeat map to think the MDS > has locked up, so it's getting killed. > > So firstly we have to fix this to have appropriate calls into > MDS::heartbeat_reset while iterating over lists of unbounded length in > EMetablob::replay. That would fix the false death of the MDS resulting from > the heartbeat expiry. > > Secondly, this EOpen was a 2.6GB log event. Something has almost certainly > gone wrong when we see that data structure grow so large, so we should > really be imposing some artificial cap there and catching the situation > earlier, rather than journal this monster event and only hitting issues > during replay. > > Thirdly, something is apparently leading the MDS to think that 5 million > files were open in this particular log segment. It seems like an improbable > situation given that I can only see a single client in action here. More > investigation needed to see how this happened. Can you describe the client > workload that was going on in the run up to the system breaking? We are in the process of rsyncing 60+ million files from our old fileservers into the new cephfs stores. We are down to the last user, and it seems he has at least 25% of tthe total files. we were copying files at a rate of about 2 million a day before this crash. > > Anyway, actions: > > 1. I'm assuming your metadata is not sensitive, as you have shared this > debug log. Please could you use "cephfs-journal-tool journal export > ~/journal.bin" to grab an offline copy of the raw journal, in case we need > to look at it later (this might take a while since your journal seems so > large, but the resulting file should compress reasonably well with "tar > cSzf"). Problematic. cephfs-journal-tool is giving me a stacktrace: http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt > 2. optimistically, you may be able to get out of this situation by modifying > the mds_beacon_grace config option on the MDS (set it to something high). > This will cause the MDS to continue sending beacons to the mons, even when a > thread is failing to yield promptly (as in this case), thereby preventing > the mons from regarding the MDS as failed. This hopefully will buy the MDS > enough time to complete replay and come up, assuming it doesn't run out of > memory in the process of dealing with whatever strangeness is in the > journal. I can hold off on this for a bit if you have something else you'd like me to save while it is in its broken state. Thanks for the detailed writeup. -- Adam > 3. If your MDS eventually makes it through recovery, unmount your client and > use "ceph daemon mds.<id> flush journal" to flush and trim the journal: this > should result in a situation where the next time the MDS starts, the > oversized journal entries are no longer present and startup should go > smoothly. > > Cheers, > John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com