2012/1/23 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: > On Thu, Jan 19, 2012 at 12:36 PM, Andrey Stepachev <octo47@xxxxxxxxx> wrote: >> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >>> On Thu, Jan 19, 2012 at 12:53 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote: >>>> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >>>>> On Wednesday, January 18, 2012, Andrey Stepachev <octo47@xxxxxxxxx> wrote: >>>>>> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>: >>>>>>> On Wed, Jan 18, 2012 at 12:48 PM, Andrey Stepachev <octo47@xxxxxxxxx> >>>>>>> wrote: >>>>>>>> But still don't know what happens with ceph, so it can't >>>>>>>> respond and hang. It is not a good behavior, because >>>>>>>> such situation leads to unresponsible cluster in case of >>>>>>>> temporal network failure. >>>>>>> >>>>>>> I'm a little concerned about this — I would expect to see hangs of up >>>>>>> to ~30 seconds (the timeout period), but for operations to then >>>>>>> continue. Are you putting the MDS down? If so, do you have any >>>>>>> standbys specified? >>>>>> >>>>>> Yes, MDS goes down (I restart it at some point, while changing something >>>>>> in config). >>>>>> Yes, i have 2 standbys. >>>>>> Clients hang more then 10 minutes. >>>>> >>>>> Okay, so it's probably an issue with the MDS not entering recovery when it >>>>> should. Are you also taking down one of the monitor nodes? There's a known >>>>> bug which can cause a standby MDS to wait up to 15 minutes if its monitor >>>>> goes down which is fixed in latest master (and maybe .40; I'd have to >>>>> check). >>>> >>>> Yes. I have collocated mon mds and osd on some nodes. >>>> And restart all daemons at once. I use 0.40. (built from my github fork). >>> >>> Hrm. I checked and the fix is in 0.40. Can you reproduce this with >>> client logging enabled (--debug_ms 1 --debug_client 10) and post the >>> logs somewhere for me to check out? That should be able to isolate the >>> problem area at least. >> >> Client writes "renew caps" and nothing more. >> I'd try to reproduce problem with more logging, but still no luck. >> May be debug serializes race somewhere and prevents >> this bug to occur. > > Any updates on this? "renew caps" being the last thing in the log > doesn't actually mean much, unfortunately. We're going to need logs of > some description in order to give you any more help. I've been switched to other urgent task now, so in a week or two i'll return back to ceph and try to reproduce this hangouts to find out what is going on. > -Greg -- Andrey. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html