Re: ?==?utf-8?q? OSD's hang after network blip

"Nick Fisk" <nick@xxxxxxxxxx> · Wed, 15 Jan 2020 16:45:18 +0000

On Wednesday, January 15, 2020 14:37 GMT, "Nick Fisk" <nick@xxxxxxxxxx> wrote: 

> Hi All,
> 
> Running 14.2.5, currently experiencing some network blips isolated to a single rack which is under investigation. However, it appears following a network blip, random OSD's in unaffected racks are sometimes not recovering from the incident and are left running running in a zombie state. The OSD's appear to be running from a process perspective, but the cluster thinks they are down and will not rejoin the cluster until the OSD process is restarted, which incidentally takes a lot longer than usual (systemctl command takes a couple of minutes to complete).
> 
> If the OSD is left in this state, CPU and memory usage of the process appears to climb, but never rejoins, at least for several hours that I have left them. Not exactly sure what the OSD is trying to do during this period. There's nothing in the logs during this hung state to indicate that anything is happening, but I will try and inject more verbose logging next time it occurs.
> 
> Not sure if anybody has come across this before or any ideas? In the past as long as OSD's have been running they have always re-joint following any network issues.
> 
> Nick
> 
> Sample from OSD and cluster logs below. Blip happened at 12:06, I restarted OSD at 12:26
> 
> OSD Logs from OSD that hung (Note this OSD was not directly affected by network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first ping sent 2020-01-15 12:06:1

<snip>

 Its just happened again and managed to pull this out of debug_osd 20 :

2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.182 v2:[2a03:25e0:253:5::76]:6839/8394683 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.184 v2:[2a03:25e0:253:5::76]:6814/7394522 says i am down in 2343138
2020-01-15 16:29:01.464 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.190 v2:[2a03:25e0:253:5::76]:6860/5986687 says i am down in 2343138
2020-01-15 16:29:01.668 7ff1763df700 10 osd.87 2343121 handle_osd_ping osd.19 v2:[2a03:25e0:253:5::12]:6815/5153900 says i am down in 2343138

And this from the daemon status output:
sudo ceph daemon osd.87 status
{
    "cluster_fsid": "c1703b54-b4cd-41ab-a3ba-4fab241b62f3",
    "osd_fsid": "0cd8fe7d-17be-4982-b76f-ef1cbed0c19b",
    "whoami": 87,
    "state": "waiting_for_healthy",
    "oldest_map": 2342407,
    "newest_map": 2343121,
    "num_pgs": 218
}

So OSD doesn't seem to be getting latest map from the mon's. Map 2343138 obviously has OSD.87 marked down hence the error messages from the osd_pings. But I'm guessing the latest map the OSD has 2343121 has it as marked up, so it never tries to "re-connect"?

Seems similar to this post from a few years back, which didn't seem to end with a form of resolution
https://www.spinics.net/lists/ceph-devel/msg31788.html

Also found this PR for Nautilus which suggested it might be a fix for the issue, but should already be part of the release I'm running:
https://github.com/ceph/ceph/pull/23958

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com