Re: OSD reboot loop after running out of memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kalle,

Memory usage is back on track for the OSDs since the OOM crash. I don’t know what caused it back then, but until all OSDs were back up together, each one of them (10 TiB capacity, 7 TiB used) ballooned to over 15 GB memory used. I’m happy to dump the stats if they’re showing any history from 2 weeks ago, but ballooning and running out of memory is not the issue anymore.

Thanks,
Stefan

________________________________
From: Kalle Happonen <kalle.happonen@xxxxxx>
Sent: Monday, December 14, 2020 5:00:17 AM
To: huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
Cc: Stefan Wild <swild@xxxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Re: OSD reboot loop after running out of memory

Hi Samuel,
I think we're hitting some niche cases. Most of our experience (and links to other posts) is here.

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/EWPPEMPAJQT6GGYSHM7GIM3BZWS2PSUY/

For the pg_log issue, the default of 3000 might be too large for some installations, depending on your PG count. We have set it to 400.

For the buffer_anon problem, there some speculation that it started when buffer_anon trimming changed. I assume it'll be fixed in a new version, these two may be candidates for fixes.

https://github.com/ceph/ceph/pull/35171
https://github.com/ceph/ceph/pull/35584

Cheers,
Kalle

----- Original Message -----
> From: huxiaoyu@xxxxxxxxxxxx
> To: "Kalle Happonen" <kalle.happonen@xxxxxx>, "Stefan Wild" <swild@xxxxxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Sent: Monday, 14 December, 2020 10:27:57
> Subject: Re:  Re: OSD reboot loop after running out of memory

> Hello, Kalle,
>
> Your comments abount some bugs with  pg_log memory and buffer_anon memory growth
> worry me a lot, as i am planning to build a cluster with the latest Nautilous
> version.
>
> Could you please comment on, how to safely deal with these bugs or to avoid, if
> indeed they occur?
>
> thanks a lot,
>
> samuel
>
>
>
> huxiaoyu@xxxxxxxxxxxx
>
> From: Kalle Happonen
> Date: 2020-12-14 08:28
> To: Stefan Wild
> CC: ceph-users
> Subject:  Re: OSD reboot loop after running out of memory
> Hi Stefan,
> we had been seeing OSDs OOMing on 14.2.13, but on a larger scale. In our case we
> hit a some bugs with pg_log memory growth and buffer_anon memory growth. Can
> you check what's taking up the memory on the OSD with the following command?
>
> ceph daemon osd.123 dump_mempools
>
>
> Cheers,
> Kalle
>
> ----- Original Message -----
>> From: "Stefan Wild" <swild@xxxxxxxxxxxxx>
>> To: "Igor Fedotov" <ifedotov@xxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
>> Sent: Sunday, 13 December, 2020 14:46:44
>> Subject:  Re: OSD reboot loop after running out of memory
>
>> Hi Igor,
>>
>> Full osd logs from startup to failed exit:
>> https://tiltworks.com/osd.1.log
>>
>> In other news, can I expect osd.10 to go down next?
>>
>> Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
>> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
>> 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
>> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
>> 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
>> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
>> 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.2
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 : cluster
>> [DBG] osd.10 failure report canceled by osd.2
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
>> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
>> 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.6
>> is reporting failure:0
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.6
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug
>> 2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.770822+0000
>> front 2020-12-13T12:39:39.770700+0000 (oldest deadline
>> 2020-12-13T12:40:05.070662+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug
>> 2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:39.977106+0000
>> front 2020-12-13T12:39:39.977176+0000 (oldest deadline
>> 2020-12-13T12:40:04.677320+0000)
>> Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug
>> 2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.310905+0000
>> front 2020-12-13T12:39:43.311164+0000 (oldest deadline
>> 2020-12-13T12:40:06.810981+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
>> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
>> 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.179+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.4
>> is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug
>> 2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.181904+0000
>> front 2020-12-13T12:39:42.181856+0000 (oldest deadline
>> 2020-12-13T12:40:06.281648+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 : cluster
>> [DBG] osd.10 failure report canceled by osd.6
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 : cluster
>> [DBG] osd.10 failure report canceled by osd.4
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
>> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
>> 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.371+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.3
>> is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.3
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.611+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.7
>> is reporting failure:0
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.7
>> Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug
>> 2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:42.972558+0000
>> front 2020-12-13T12:39:42.972702+0000 (oldest deadline
>> 2020-12-13T12:40:05.272435+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug
>> 2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 heartbeat_check: no
>> reply from 172.18.189.20:6878 osd.10 since back 2020-12-13T12:39:43.326792+0000
>> front 2020-12-13T12:39:43.326666+0000 (oldest deadline
>> 2020-12-13T12:40:07.426786+0000)
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 : cluster
>> [DBG] osd.10 failure report canceled by osd.3
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 : cluster
>> [DBG] osd.10 failure report canceled by osd.7
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.727+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.0
>> is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.839+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.5
>> is reporting failure:0
>> Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 : cluster
>> [DBG] osd.10 failure report canceled by osd.0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 : cluster
>> [DBG] osd.10 failure report canceled by osd.5
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.575+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.11
>> is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.11
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.783+0000 7fe929be8700  1 mon.ceph-tpa-server1@0(leader).osd
>> e13375 prepare_failure osd.10
>> [v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] from osd.8
>> is reporting failure:0
>> Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug
>> 2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log [DBG] :
>> osd.10 failure report canceled by osd.8
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 : cluster
>> [DBG] osd.10 failure report canceled by osd.11
>> Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster
>> 2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 : cluster
>> [DBG] osd.10 failure report canceled by osd.8
>>
>>
>> Thanks,
>> Stefan
>>
>>
>>On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov@xxxxxxx> wrote:
>>
>>    Hi Stefan,
>>
>>    could you please share OSD startup log from /var/log/ceph?
>>
>>
>>    Thanks,
>>
>>    Igor
>>
>>    On 12/13/2020 5:44 AM, Stefan Wild wrote:
>>    > Just had another look at the logs and this is what I did notice after the
>>    > affected OSD starts up.
>>    >
>>    > Loads of entries of this sort:
>>    >
>>    > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug
>>    > 2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy
>>    > 'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
>>    >
>>    > Then a few pages of this:
>>    >
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9249>
>>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>>    > empty local-lis/les=13015/13016 n=0 ec=1530
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9248>
>>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>>    > empty local-lis/les=13015/13016 n=0 ec=1530
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9247>
>>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 pg[28.11(
>>    > empty local-lis/les=13015/13016 n=0 ec=1530
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9246>
>>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 pg[28.11(
>>    > empty local-lis/les=13015/13016 n=0 ec=1530
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9245>
>>    > 2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 pg[28.11(
>>    > empty local-lis/les=13015/13016 n=0 ec=1530
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9244>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9243>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9242>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9241>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9240>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9239>
>>    > 2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 pg[19.69s2(
>>    > v 3437'1753192 (3437'1753192,3437'1753192
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9238>
>>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>>    > v 3437'1759161 (3437'1759161,3437'175916
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9237>
>>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>>    > v 3437'1759161 (3437'1759161,3437'175916
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9236>
>>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 pg[19.3bs10(
>>    > v 3437'1759161 (3437'1759161,3437'175916
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9235>
>>    > 2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 pg[19.3bs10(
>>    > v 3437'1759161 (3437'1759161,3437'175916
>>    >
>>    > And this is where it crashes:
>>    >
>>    > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug  -9232>
>>    > 2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log [DBG] :
>>    > purged_snaps scrub starts
>>    > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]:
>>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Main process exited,
>>    > code=exited, status=1/FAILURE
>>    > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]:
>>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Failed with result
>>    > 'exit-code'.
>>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Service hold-off time
>>    > over, scheduling restart.
>>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]:
>>    > ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Scheduled restart job,
>>    > restart counter is at 1.
>>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph osd.1 for
>>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph osd.1 for
>>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
>>    > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph osd.1 for
>>    > 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
>>    >
>>    > Hope that helps…
>>    >
>>    >
>>    > Thanks,
>>    > Stefan
>>    >
>>    >
>>    > From: Stefan Wild <swild@xxxxxxxxxxxxx>
>>    > Date: Saturday, December 12, 2020 at 9:35 PM
>>    > To: "ceph-users@xxxxxxx" <ceph-users@xxxxxxx>
>>    > Subject: OSD reboot loop after running out of memory
>>    >
>>    > Hi,
>>    >
>>    > We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one
>>    > of the server ran out of memory for unknown reasons (normally the machine uses
>>    > about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an
>>    > endless restart loop. Logs will just mention system seeing the daemon fail and
>>    > then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail
>>    > this way at separate times. We resorted to wiping the affected OSD and
>>    > re-adding it to the cluster, but it seems as soon as all PGs have moved to the
>>    > OSD, the next one fails.
>>    >
>>    > This is also keeping us from re-deploying RGW, which was affected by the same
>>    > out of memory incident, since cephadm runs a check and won’t deploy the service
>>    > unless the cluster is in HEALTH_OK status.
>>    >
>>    > Any help would be greatly appreciated.
>>    >
>>    > Thanks,
>>    > Stefan
>>    >
>>    > _______________________________________________
>>    > ceph-users mailing list -- ceph-users@xxxxxxx
>>    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>    _______________________________________________
>>    ceph-users mailing list -- ceph-users@xxxxxxx
>>    To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux