On Wed, Dec 21, 2011 at 4:37 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote: > On Friday 02 December 2011 wrote Sage Weil: >> On Fri, 2 Dec 2011, Amon Ott wrote: >> > On Thursday 01 December 2011 you wrote: >> > > On all four nodes of my test cluster, MDS crashes with a trace like >> > > that in bug #1047. Example and ceph.conf attached. Ceph server side is >> > > from git master, last commit ce6572273943ffdca4b7dc5344152d6c35106a2d. >> > > >> > > MDS does not start on any node here, it reliably crashes with that >> > > assert. >> > >> > Does it makes sense for you to keep the cluster in that broken state, so >> > that we can reproduce that bug or test a potential fix? Otherwise, I >> > would recreate the Ceph filesystem to make more tests. I also have a full >> > log of one mds from start to crash here. >> >> Can you attach the log to #1047 for posterity? I'll take a quick look and >> see if there is any further info to gain from the log. I'm guessing the >> actual bug occured before the crash, when the anchor table wasn't updated >> properly, but there may be clues here. > > Did you find some time to look into this? The bug makes Ceph unusable for us > even with moderate load. All mds instances die with the same assert, the only > way to recover in that state is to recreate the complete ceph fs and restore > backups. Sage is gone on vacation right now (unless he decides not to be for a while), but we've been focusing our efforts on the OSDs lately so I don't think he's looked at it. I'll see if I can carve out some time tomorrow or Friday, but I can't promise anything. Alexandre, can you check this bug and make sure it looks like the same one you reported as #1850? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html