On Fri, May 22, 2015 at 11:34 AM, Adam Tygart <mozes@xxxxxxx> wrote: > On Fri, May 22, 2015 at 11:47 AM, John Spray <john.spray@xxxxxxxxxx> wrote: >> >> >> On 22/05/2015 15:33, Adam Tygart wrote: >>> >>> Hello all, >>> >>> The ceph-mds servers in our cluster are performing a constant >>> boot->replay->crash in our systems. >>> >>> I have enable debug logging for the mds for a restart cycle on one of >>> the nodes[1]. >> >> >> You found a bug, or more correctly you probably found multiple bugs... >> >> It looks like your journal contains an EOpen event that lists 5307092 open >> files. Because the MDS only drops its lock between events, not during >> processing a single one, this is causing the heartbeat map to think the MDS >> has locked up, so it's getting killed. >> >> So firstly we have to fix this to have appropriate calls into >> MDS::heartbeat_reset while iterating over lists of unbounded length in >> EMetablob::replay. That would fix the false death of the MDS resulting from >> the heartbeat expiry. >> >> Secondly, this EOpen was a 2.6GB log event. Something has almost certainly >> gone wrong when we see that data structure grow so large, so we should >> really be imposing some artificial cap there and catching the situation >> earlier, rather than journal this monster event and only hitting issues >> during replay. >> >> Thirdly, something is apparently leading the MDS to think that 5 million >> files were open in this particular log segment. It seems like an improbable >> situation given that I can only see a single client in action here. More >> investigation needed to see how this happened. Can you describe the client >> workload that was going on in the run up to the system breaking? > > We are in the process of rsyncing 60+ million files from our old > fileservers into the new cephfs stores. We are down to the last user, > and it seems he has at least 25% of tthe total files. we were copying > files at a rate of about 2 million a day before this crash. > >> >> Anyway, actions: >> >> 1. I'm assuming your metadata is not sensitive, as you have shared this >> debug log. Please could you use "cephfs-journal-tool journal export >> ~/journal.bin" to grab an offline copy of the raw journal, in case we need >> to look at it later (this might take a while since your journal seems so >> large, but the resulting file should compress reasonably well with "tar >> cSzf"). > > Problematic. cephfs-journal-tool is giving me a stacktrace: > http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt > >> 2. optimistically, you may be able to get out of this situation by modifying >> the mds_beacon_grace config option on the MDS (set it to something high). >> This will cause the MDS to continue sending beacons to the mons, even when a >> thread is failing to yield promptly (as in this case), thereby preventing >> the mons from regarding the MDS as failed. This hopefully will buy the MDS >> enough time to complete replay and come up, assuming it doesn't run out of >> memory in the process of dealing with whatever strangeness is in the >> journal. > > I can hold off on this for a bit if you have something else you'd like > me to save while it is in its broken state. Ugh. We appear to be trying to allocate too much memory for this event in the journal dump; we'll need to fix this. :( In the meantime, it's a harder workflow but if you can grab all the objects in the metadata pool whose name starts with "200." that's the raw data that the journal dumper is trying to grab and we can assemble it ourselves for checking later on. That said, you might actually have that many files still "open" in some fashion from your clients. If you can dump caches on them periodically, and/or limit the number of files that rsync is working on at once, that should help keep this from happening again. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com