Re: How to remove lost objects.

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 19 Jan 2012 11:31:26 -0800



On Thu, Jan 19, 2012 at 12:53 AM, Andrey Stepachev <octo47@xxxxxxxxx> wrote:
> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>:
>> On Wednesday, January 18, 2012, Andrey Stepachev <octo47@xxxxxxxxx> wrote:
>>> 2012/1/19 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>:
>>>> On Wed, Jan 18, 2012 at 12:48 PM, Andrey Stepachev <octo47@xxxxxxxxx>
>>>> wrote:
>>>>> But still don't know what happens with ceph, so it can't
>>>>> respond and hang. It is not a good behavior, because
>>>>> such situation leads to unresponsible cluster in case of
>>>>> temporal network failure.
>>>>
>>>> I'm a little concerned about this — I would expect to see hangs of up
>>>> to ~30 seconds (the timeout period), but for operations to then
>>>> continue. Are you putting the MDS down? If so, do you have any
>>>> standbys specified?
>>>
>>> Yes, MDS goes down (I restart it at some point, while changing something
>>> in config).
>>> Yes, i have 2 standbys.
>>> Clients hang more then 10 minutes.
>>
>> Okay, so it's probably an issue with the MDS not entering recovery when it
>> should. Are you also taking down one of the monitor nodes? There's a known
>> bug which can cause a standby MDS to wait up to 15 minutes if its monitor
>> goes down which is fixed in latest master (and maybe .40; I'd have to
>> check).
>
> Yes. I have collocated mon mds and osd on some nodes.
> And restart all daemons at once. I use 0.40. (built from my github fork).

Hrm. I checked and the fix is in 0.40. Can you reproduce this with
client logging enabled (--debug_ms 1 --debug_client 10) and post the
logs somewhere for me to check out? That should be able to isolate the
problem area at least.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html