On Fri, 27 Jan 2012, Amon Ott wrote: > On Thursday 29 December 2011 wrote Amon Ott: > > I finally got the test cluster freed up for more Ceph testing. > > > > On Friday 23 December 2011 wrote Gregory Farnum: > > > Unfortunately there's not enough info in this log either. If you can > > > reproduce it with "mds debug = 20" and put that log somewhere, it > > > ought to be enough to work out what's going on, though. Sorry. :( > > > -Greg > > > > Here is what MDS logs with debug 20. No idea if it really helps. The > > cluster is still in the broken state, should I try to reproduce with a > > recreated ceph fs and debug 20? This could be GBs of logs. > > Update: I recreated the Ceph FS with release 0.40. It broke only because of a > btrfs bug hitting two of the four nodes (after ca. one day of heavy load) and > recovered without problems when the nodes were back. Then I recreated with > ext4 as osd storage area and have not managed to break it within four days, > two of these under heavy load. > > This means that this bug is probably fixed. It might be related to the > automatic reconnect of mds, which avoids meta data inconsistencies. :) Yeah, I suspect that the problem is related to the MDS journal replay and the two-phase-commit stuff going on with the anchor table updates. I think we should keep this open until we can do MDS restart thrashing against a heavy link workload. Unless there was something you found/fixed before, Greg? Thanks for keeping an eye on this, Amon! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html