Re: Bug #1047 reproduced

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 27 Jan 2012 06:50:40 -0800 (PST)

On Fri, 27 Jan 2012, Amon Ott wrote:
> On Thursday 29 December 2011 wrote Amon Ott:
> > I finally got the test cluster freed up for more Ceph testing.
> >
> > On Friday 23 December 2011 wrote Gregory Farnum:
> > > Unfortunately there's not enough info in this log either. If you can
> > > reproduce it with "mds debug = 20" and put that log somewhere, it
> > > ought to be enough to work out what's going on, though. Sorry. :(
> > > -Greg
> >
> > Here is what MDS logs with debug 20. No idea if it really helps. The
> > cluster is still in the broken state, should I try to reproduce with a
> > recreated ceph fs and debug 20? This could be GBs of logs.
> 
> Update: I recreated the Ceph FS with release 0.40. It broke only because of a 
> btrfs bug hitting two of the four nodes (after ca. one day of heavy load) and 
> recovered without problems when the nodes were back. Then I recreated with 
> ext4 as osd storage area and have not managed to break it within four days, 
> two of these under heavy load.
> 
> This means that this bug is probably fixed. It might be related to the 
> automatic reconnect of mds, which avoids meta data inconsistencies. :)

Yeah, I suspect that the problem is related to the MDS journal replay and 
the two-phase-commit stuff going on with the anchor table updates.  I 
think we should keep this open until we can do MDS restart thrashing 
against a heavy link workload.

Unless there was something you found/fixed before, Greg?

Thanks for keeping an eye on this, Amon!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html