Re: OSD reboot loop after running out of memory

Kalle Happonen <kalle.happonen@xxxxxx> · Mon, 14 Dec 2020 09:28:25 +0200 (EET)

Hi Stefan,
we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can you check what's taking up the memory on the OSD with the following command?

ceph daemon osd.123 dump_mempools

Cheers,
Kalle

----- Original Message -----
> From: "Stefan Wild" <swild@xxxxxxxxxxxxx>
> To: "Igor Fedotov" <ifedotov@xxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
> Sent: Sunday, 13 December, 2020 14:46:44
> Subject:  Re: OSD reboot loop after running out of memory

> Hi Igor,
> 
> Full osd logs from startup to failed exit:
> https://tiltworks.com/osd.1.log
> 
> In other news, can I expect osd.10 to go down next?
> 
> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
> 2020-12-13T12:40:06.810981+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
> 2020-12-13T12:40:06.281648+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster
> [DBG] osd.10 failure report canceled by osd.2
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6
> is reporting failure:0
> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.6
> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000
> front 2020-12-13T12:39:39.770700+0000 (oldest deadline
> 2020-12-13T12:40:05.070662+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.977106+0000
> front 2020-12-13T12:39:39.977176+0000 (oldest deadline
> 2020-12-13T12:40:04.677320+0000)
> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
> 2020-12-13T12:40:06.810981+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.179+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.4
> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
> 2020-12-13T12:40:06.281648+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster
> [DBG] osd.10 failure report canceled by osd.6
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster
> [DBG] osd.10 failure report canceled by osd.4
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.371+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.3
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.3
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.611+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.7
> is reporting failure:0
> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.7
> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
> 2020-12-13T12:40:05.272435+0000)
> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
> 2020-12-13T12:40:07.426786+0000)
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster
> [DBG] osd.10 failure report canceled by osd.3
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster
> [DBG] osd.10 failure report canceled by osd.7
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.727+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.0
> is reporting failure:0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.839+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.5
> is reporting failure:0
> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.5
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster
> [DBG] osd.10 failure report canceled by osd.0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster
> [DBG] osd.10 failure report canceled by osd.5
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.575+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.11
> is reporting failure:0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.11
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.783+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
> e13375 prepare_failure osd.10
> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.8
> is reporting failure:0
> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
> 2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
> osd.10 failure report canceled by osd.8
> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster
> [DBG] osd.10 failure report canceled by osd.11
> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster
> [DBG] osd.10 failure report canceled by osd.8
> 
> 
> Thanks,
> Stefan
> 
> 
>On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov@xxxxxxx> wrote:
> 
>    Hi Stefan,
> 
>    could you please share OSD startup log from /var/log/ceph?
> 
> 
>    Thanks,
> 
>    Igor
> 
>    On 12/13/2020 5:44 AM, Stefan Wild wrote:
>    > Just had another look at the logs and this is what I did notice after the
>    > affected OSD starts up.
>    >
>    > Loads of entries of this sort:
>    >
>    > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
>    > 2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy
>    > 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
>    >
>    > Then a few pages of this:
>    >
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9249>
>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>    > empty local-lis/les=13015/13016 n=0 ec=1530
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9248>
>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>    > empty local-lis/les=13015/13016 n=0 ec=1530
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9247>
>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>    > empty local-lis/les=13015/13016 n=0 ec=1530
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9246>
>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 pg[28.11(
>    > empty local-lis/les=13015/13016 n=0 ec=1530
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9245>
>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 pg[28.11(
>    > empty local-lis/les=13015/13016 n=0 ec=1530
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9244>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9243>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9242>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9241>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9240>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9239>
>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>    > v 3437'1753192 (3437'1753192,3437'1753192
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9238>
>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>    > v 3437'1759161 (3437'1759161,3437'175916
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9237>
>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>    > v 3437'1759161 (3437'1759161,3437'175916
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9236>
>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>    > v 3437'1759161 (3437'1759161,3437'175916
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9235>
>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 pg[19.3bs10(
>    > v 3437'1759161 (3437'1759161,3437'175916
>    >
>    > And this is where it crashes:
>    >
>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9232>
>    > 2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log [DBG] :
>    > purged_snaps scrub starts
>    > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Main process exited,
>    > code=exited, status=1/FAILURE
>    > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Failed with result
>    > 'exit-code'.
>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Service hold-off time
>    > over, scheduling restart.
>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Scheduled restart job,
>    > restart counter is at 1.
>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>    >
>    > Hope that helps…
>    >
>    >
>    > Thanks,
>    > Stefan
>    >
>    >
>    > From: Stefan Wild <swild@xxxxxxxxxxxxx>
>    > Date: Saturday, December 12, 2020 at 9:35 PM
>    > To: "ceph-users@xxxxxxx" <ceph-users@xxxxxxx>
>    > Subject: OSD reboot loop after running out of memory
>    >
>    > Hi,
>    >
>    > We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one
>    > of the server ran out of memory for unknown reasons (normally the machine uses
>    > about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an
>    > endless restart loop. Logs will just mention system seeing the daemon fail and
>    > then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail
>    > this way at separate times. We resorted to wiping the affected OSD and
>    > re-adding it to the cluster, but it seems as soon as all PGs have moved to the
>    > OSD, the next one fails.
>    >
>    > This is also keeping us from re-deploying RGW, which was affected by the same
>    > out of memory incident, since cephadm runs a check and won’t deploy the service
>    > unless the cluster is in HEALTH_OK status.
>    >
>    > Any help would be greatly appreciated.
>    >
>    > Thanks,
>    > Stefan
>    >
>    > _______________________________________________
>    > ceph-users mailing list -- ceph-users@xxxxxxx
>    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>    _______________________________________________
>    ceph-users mailing list -- ceph-users@xxxxxxx
>    To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx