Re: OSD not coming back up again

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Thu, 11 Aug 2016 13:13:54 +0200

On 11-8-2016 13:02, Wido den Hollander wrote:
>> Right before setting the osd to down, the newest map is 173.
>> So some maps have been exchanged....
>>
>> How does the OSD decide that it is healthy?
>> If it gets peer (ping) messages that is is up?
>>
> 
> Not 100% sure, but waiting for healthy is something I haven't seen before.
> 
> Is it incrementing the newest_map when the cluster advances?
> 
>>> Maybe try debug_osd = 20
>>
>> But then still I need to know what to look for, since 20 generates
>> serious output.
>>
> True, it will. But I don't know exactly what to look for. debug_osd = 20 might reveal more information there.
> 
> It might be a very simple log line which tells you what is going on.

See my other mail, log did not reveal much.
Other than that it made me look at the sockets.

But looking at the socket-states, I think the sockets on the OSD that is
going down are not correctly closed. And so osd.0 thinks it is still
connected. And osd.1 and osd.2 are without connection to osd.0, so they
are correct in suggesting that it is dead.

--WjW

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html