On Thu, Jan 19, 2012 at 12:53 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote: > 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >> On Wednesday, January 18, 2012, Andrey Stepachev <octo47@xxxxxxxxx> wrote: >>> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >>>> On Wed, Jan 18, 2012 at 12:48 PM, Andrey Stepachev <octo47@xxxxxxxxx> >>>> wrote: >>>>> But still don't know what happens with ceph, so it can't >>>>> respond and hang. It is not a good behavior, because >>>>> such situation leads to unresponsible cluster in case of >>>>> temporal network failure. >>>> >>>> I'm a little concerned about this — I would expect to see hangs of up >>>> to ~30 seconds (the timeout period), but for operations to then >>>> continue. Are you putting the MDS down? If so, do you have any >>>> standbys specified? >>> >>> Yes, MDS goes down (I restart it at some point, while changing something >>> in config). >>> Yes, i have 2 standbys. >>> Clients hang more then 10 minutes. >> >> Okay, so it's probably an issue with the MDS not entering recovery when it >> should. Are you also taking down one of the monitor nodes? There's a known >> bug which can cause a standby MDS to wait up to 15 minutes if its monitor >> goes down which is fixed in latest master (and maybe .40; I'd have to >> check). > > Yes. I have collocated mon mds and osd on some nodes. > And restart all daemons at once. I use 0.40. (built from my github fork). Hrm. I checked and the fix is in 0.40. Can you reproduce this with client logging enabled (--debug_ms 1 --debug_client 10) and post the logs somewhere for me to check out? That should be able to isolate the problem area at least. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html