Hi Stefan, we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can you check what's taking up the memory on the OSD with the following command? ceph daemon osd.123 dump_mempools Cheers, Kalle ----- Original Message ----- > From: "Stefan Wild" <swild@xxxxxxxxxxxxx> > To: "Igor Fedotov" <ifedotov@xxxxxxx>, "ceph-users" <ceph-users@xxxxxxx> > Sent: Sunday, 13 December, 2020 14:46:44 > Subject: Re: OSD reboot loop after running out of memory > Hi Igor, > > Full osd logs from startup to failed exit: > https://tiltworks.com/osd.1.log > > In other news, can I expect osd.10 to go down next? > > Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug > 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000 > front 2020-12-13T12:39:43.311164+0000 (oldest deadline > 2020-12-13T12:40:06.810981+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug > 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 > front 2020-12-13T12:39:42.972702+0000 (oldest deadline > 2020-12-13T12:40:05.272435+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug > 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000 > front 2020-12-13T12:39:42.181856+0000 (oldest deadline > 2020-12-13T12:40:06.281648+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:15.171+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2 > is reporting failure:0 > Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:15.171+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.2 > Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster > [DBG] osd.10 failure report canceled by osd.2 > Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug > 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 > front 2020-12-13T12:39:43.326666+0000 (oldest deadline > 2020-12-13T12:40:07.426786+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:15.423+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6 > is reporting failure:0 > Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:15.423+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.6 > Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug > 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000 > front 2020-12-13T12:39:39.770700+0000 (oldest deadline > 2020-12-13T12:40:05.070662+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug > 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.977106+0000 > front 2020-12-13T12:39:39.977176+0000 (oldest deadline > 2020-12-13T12:40:04.677320+0000) > Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug > 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000 > front 2020-12-13T12:39:43.311164+0000 (oldest deadline > 2020-12-13T12:40:06.810981+0000) > Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug > 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 > front 2020-12-13T12:39:42.972702+0000 (oldest deadline > 2020-12-13T12:40:05.272435+0000) > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.179+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4 > is reporting failure:0 > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.179+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.4 > Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug > 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000 > front 2020-12-13T12:39:42.181856+0000 (oldest deadline > 2020-12-13T12:40:06.281648+0000) > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster > [DBG] osd.10 failure report canceled by osd.6 > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster > [DBG] osd.10 failure report canceled by osd.4 > Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug > 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 > front 2020-12-13T12:39:43.326666+0000 (oldest deadline > 2020-12-13T12:40:07.426786+0000) > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.371+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.3 > is reporting failure:0 > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.371+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.3 > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.611+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.7 > is reporting failure:0 > Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:16.611+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.7 > Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug > 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000 > front 2020-12-13T12:39:42.972702+0000 (oldest deadline > 2020-12-13T12:40:05.272435+0000) > Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug > 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no > reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000 > front 2020-12-13T12:39:43.326666+0000 (oldest deadline > 2020-12-13T12:40:07.426786+0000) > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster > [DBG] osd.10 failure report canceled by osd.3 > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster > [DBG] osd.10 failure report canceled by osd.7 > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:17.727+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.0 > is reporting failure:0 > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:17.727+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.0 > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:17.839+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.5 > is reporting failure:0 > Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:17.839+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.5 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster > [DBG] osd.10 failure report canceled by osd.0 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster > [DBG] osd.10 failure report canceled by osd.5 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:18.575+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.11 > is reporting failure:0 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:18.575+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.11 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:18.783+0000 7fe929be8700 1 mon.ceph-tpa-server1@0(leader).osd > e13375 prepare_failure osd.10 > [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.8 > is reporting failure:0 > Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug > 2020-12-13T12:40:18.783+0000 7fe929be8700 0 log_channel(cluster) log [DBG] : > osd.10 failure report canceled by osd.8 > Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster > [DBG] osd.10 failure report canceled by osd.11 > Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster > 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster > [DBG] osd.10 failure report canceled by osd.8 > > > Thanks, > Stefan > > >On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov@xxxxxxx> wrote: > > Hi Stefan, > > could you please share OSD startup log from /var/log/ceph? > > > Thanks, > > Igor > > On 12/13/2020 5:44 AM, Stefan Wild wrote: > > Just had another look at the logs and this is what I did notice after the > > affected OSD starts up. > > > > Loads of entries of this sort: > > > > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug > > 2020-12-13T02:38:40.851+0000 7fafd32c7700 1 heartbeat_map is_healthy > > 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15 > > > > Then a few pages of this: > > > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249> > > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( > > empty local-lis/les=13015/13016 n=0 ec=1530 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248> > > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( > > empty local-lis/les=13015/13016 n=0 ec=1530 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247> > > 2020-12-13T02:35:44.018+0000 7fafb621d700 5 osd.1 pg_epoch: 13024 pg[28.11( > > empty local-lis/les=13015/13016 n=0 ec=1530 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246> > > 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13024 pg[28.11( > > empty local-lis/les=13015/13016 n=0 ec=1530 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245> > > 2020-12-13T02:35:44.018+0000 7fafb621d700 1 osd.1 pg_epoch: 13026 pg[28.11( > > empty local-lis/les=13015/13016 n=0 ec=1530 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 1 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239> > > 2020-12-13T02:35:44.022+0000 7fafb721f700 5 osd.1 pg_epoch: 13143 pg[19.69s2( > > v 3437'1753192 (3437'1753192,3437'1753192 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238> > > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( > > v 3437'1759161 (3437'1759161,3437'175916 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237> > > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( > > v 3437'1759161 (3437'1759161,3437'175916 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236> > > 2020-12-13T02:35:44.022+0000 7fafb521b700 5 osd.1 pg_epoch: 13143 pg[19.3bs10( > > v 3437'1759161 (3437'1759161,3437'175916 > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235> > > 2020-12-13T02:35:44.022+0000 7fafb521b700 1 osd.1 pg_epoch: 13143 pg[19.3bs10( > > v 3437'1759161 (3437'1759161,3437'175916 > > > > And this is where it crashes: > > > > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232> > > 2020-12-13T02:35:44.022+0000 7fafd02c1700 0 log_channel(cluster) log [DBG] : > > purged_snaps scrub starts > > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]: > > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Main process exited, > > code=exited, status=1/FAILURE > > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]: > > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Failed with result > > 'exit-code'. > > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: > > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Service hold-off time > > over, scheduling restart. > > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: > > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Scheduled restart job, > > restart counter is at 1. > > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for > > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142. > > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for > > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142... > > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for > > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142. > > > > Hope that helps… > > > > > > Thanks, > > Stefan > > > > > > From: Stefan Wild <swild@xxxxxxxxxxxxx> > > Date: Saturday, December 12, 2020 at 9:35 PM > > To: "ceph-users@xxxxxxx" <ceph-users@xxxxxxx> > > Subject: OSD reboot loop after running out of memory > > > > Hi, > > > > We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one > > of the server ran out of memory for unknown reasons (normally the machine uses > > about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an > > endless restart loop. Logs will just mention system seeing the daemon fail and > > then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail > > this way at separate times. We resorted to wiping the affected OSD and > > re-adding it to the cluster, but it seems as soon as all PGs have moved to the > > OSD, the next one fails. > > > > This is also keeping us from re-deploying RGW, which was affected by the same > > out of memory incident, since cephadm runs a check and won’t deploy the service > > unless the cluster is in HEALTH_OK status. > > > > Any help would be greatly appreciated. > > > > Thanks, > > Stefan > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx