Re: OSD reboot loop after running out of memory

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 14 Dec 2020 14:44:07 +0300

Just a note - all the below is almost completely unrelated to high RAM 
usage. The latter is a different issue which presumably just triggered 
PG removal one...

On 12/14/2020 2:39 PM, Igor Fedotov wrote:
Hi Stefan,

given the crash backtrace in your log I presume some data removal is 
in progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3: 
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long, 
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4: 
(KernelDevice::read_random(unsigned long, unsigned long, char*, 
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5: 
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
long, char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19: 
(BlueStore::_do_omap_clear(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Onode>&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20: 
(BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21: 
(BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24: 
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, 
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25: 
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26: 
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e) 
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data 
rebalancing triggered PG migration (and hence source PG removal) for you?

Highly likely you're facing a well known issue with RocksDB/BlueFS 
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers 
suicide timeout.

We've had multiple threads on the issue in this mailing list - the 
latest one is at 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/

For now the good enough workaround is manual offline DB compaction for 
all the OSDs (this might have temporary effect though as the removal 
proceeds).

Additionally there are users' reports that recent default value's 
modification  for bluefs_buffered_io setting has negative impact (or 
just worsen existing issue with massive removal) as well. So you might 
want to switch it back to true.

As for OSD.10 - can't say for sure as I haven't seen its' logs but I 
think it's experiencing the same issue which might eventually lead it 
into unresponsive state as well. Just grep its log for "heartbeat_map 
is_healthy 'OSD::osd_op_tp thread" strings.

Thanks,

Igor

On 12/13/2020 3:46 PM, Stefan Wild wrote:
Hi Igor,

Full osd logs from startup to failed exit:
https://tiltworks.com/osd.1.log

In other news, can I expect osd.10 to go down next?

Dec 13 07:40:14 ceph-tpa-server1 bash[1825010]: debug 
2020-12-13T12:40:14.823+0000 7ff37c2e1700 -1 osd.7 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000 
(oldest deadline 2020-12-13T12:40:06.810981+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[1824817]: debug 
2020-12-13T12:40:15.055+0000 7f9220af3700 -1 osd.11 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
(oldest deadline 2020-12-13T12:40:05.272435+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[2060428]: debug 
2020-12-13T12:40:15.155+0000 7fb247eaf700 -1 osd.8 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000 
(oldest deadline 2020-12-13T12:40:06.281648+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.2 is reporting failure:0
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.171+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:15.176057+0000 mon.ceph-tpa-server1 (mon.0) 1172513 
: cluster [DBG] osd.10 failure report canceled by osd.2
Dec 13 07:40:15 ceph-tpa-server1 bash[1824779]: debug 
2020-12-13T12:40:15.295+0000 7fa60679a700 -1 osd.0 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
(oldest deadline 2020-12-13T12:40:07.426786+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.423+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.6 is reporting failure:0
Dec 13 07:40:15 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:15.423+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.6
Dec 13 07:40:15 ceph-tpa-server1 bash[1824845]: debug 
2020-12-13T12:40:15.447+0000 7f85048db700 -1 osd.3 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:39.770822+0000 front 2020-12-13T12:39:39.770700+0000 
(oldest deadline 2020-12-13T12:40:05.070662+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[231499]: debug 
2020-12-13T12:40:15.687+0000 7fa8e1800700 -1 osd.4 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:39.977106+0000 front 2020-12-13T12:39:39.977176+0000 
(oldest deadline 2020-12-13T12:40:04.677320+0000)
Dec 13 07:40:15 ceph-tpa-server1 bash[1825010]: debug 
2020-12-13T12:40:15.799+0000 7ff37c2e1700 -1 osd.7 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.310905+0000 front 2020-12-13T12:39:43.311164+0000 
(oldest deadline 2020-12-13T12:40:06.810981+0000)
Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug 
2020-12-13T12:40:16.019+0000 7f9220af3700 -1 osd.11 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
(oldest deadline 2020-12-13T12:40:05.272435+0000)
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.179+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.4 is reporting failure:0
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.179+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.4
Dec 13 07:40:16 ceph-tpa-server1 bash[2060428]: debug 
2020-12-13T12:40:16.191+0000 7fb247eaf700 -1 osd.8 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.181904+0000 front 2020-12-13T12:39:42.181856+0000 
(oldest deadline 2020-12-13T12:40:06.281648+0000)
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:15.429755+0000 mon.ceph-tpa-server1 (mon.0) 1172514 
: cluster [DBG] osd.10 failure report canceled by osd.6
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:16.183521+0000 mon.ceph-tpa-server1 (mon.0) 1172515 
: cluster [DBG] osd.10 failure report canceled by osd.4
Dec 13 07:40:16 ceph-tpa-server1 bash[1824779]: debug 
2020-12-13T12:40:16.303+0000 7fa60679a700 -1 osd.0 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
(oldest deadline 2020-12-13T12:40:07.426786+0000)
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.371+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.3 is reporting failure:0
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.371+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.3
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.611+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.7 is reporting failure:0
Dec 13 07:40:16 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:16.611+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.7
Dec 13 07:40:16 ceph-tpa-server1 bash[1824817]: debug 
2020-12-13T12:40:16.979+0000 7f9220af3700 -1 osd.11 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:42.972558+0000 front 2020-12-13T12:39:42.972702+0000 
(oldest deadline 2020-12-13T12:40:05.272435+0000)
Dec 13 07:40:17 ceph-tpa-server1 bash[1824779]: debug 
2020-12-13T12:40:17.271+0000 7fa60679a700 -1 osd.0 13375 
heartbeat_check: no reply from 172.18.189.20:6878 osd.10 since back 
2020-12-13T12:39:43.326792+0000 front 2020-12-13T12:39:43.326666+0000 
(oldest deadline 2020-12-13T12:40:07.426786+0000)
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:16.378213+0000 mon.ceph-tpa-server1 (mon.0) 1172516 
: cluster [DBG] osd.10 failure report canceled by osd.3
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:16.616685+0000 mon.ceph-tpa-server1 (mon.0) 1172517 
: cluster [DBG] osd.10 failure report canceled by osd.7
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:17.727+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.0 is reporting failure:0
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:17.727+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.0
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:17.839+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.5 is reporting failure:0
Dec 13 07:40:17 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:17.839+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.5
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:17.733200+0000 mon.ceph-tpa-server1 (mon.0) 1172518 
: cluster [DBG] osd.10 failure report canceled by osd.0
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:17.843775+0000 mon.ceph-tpa-server1 (mon.0) 1172519 
: cluster [DBG] osd.10 failure report canceled by osd.5
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:18.575+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.11 is reporting failure:0
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:18.575+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.11
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:18.783+0000 7fe929be8700  1 
mon.ceph-tpa-server1@0(leader).osd e13375 prepare_failure osd.10 
[v2:172.18.189.20:6872/2139598710,v1:172.18.189.20:6873/2139598710] 
from osd.8 is reporting failure:0
Dec 13 07:40:18 ceph-tpa-server1 bash[1822497]: debug 
2020-12-13T12:40:18.783+0000 7fe929be8700  0 log_channel(cluster) log 
[DBG] : osd.10 failure report canceled by osd.8
Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:18.578914+0000 mon.ceph-tpa-server1 (mon.0) 1172520 
: cluster [DBG] osd.10 failure report canceled by osd.11
Dec 13 07:40:19 ceph-tpa-server1 bash[1822497]: cluster 
2020-12-13T12:40:18.789301+0000 mon.ceph-tpa-server1 (mon.0) 1172521 
: cluster [DBG] osd.10 failure report canceled by osd.8

Thanks,
Stefan

On 12/13/20, 2:18 AM, "Igor Fedotov" <ifedotov@xxxxxxx> wrote:

     Hi Stefan,

     could you please share OSD startup log from /var/log/ceph?

     Thanks,

     Igor

     On 12/13/2020 5:44 AM, Stefan Wild wrote:
     > Just had another look at the logs and this is what I did 
notice after the affected OSD starts up.
     >
     > Loads of entries of this sort:
     >
     > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 
2020-12-13T02:38:40.851+0000 7fafd32c7700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fafb721f700' had timed out after 15
     >
     > Then a few pages of this:
     >
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9249> 
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9248> 
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9247> 
2020-12-13T02:35:44.018+0000 7fafb621d700  5 osd.1 pg_epoch: 13024 
pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9246> 
2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13024 
pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9245> 
2020-12-13T02:35:44.018+0000 7fafb621d700  1 osd.1 pg_epoch: 13026 
pg[28.11( empty local-lis/les=13015/13016 n=0 ec=1530
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9244> 
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9243> 
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9242> 
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9241> 
2020-12-13T02:35:44.022+0000 7fafb721f700  1 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9240> 
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9239> 
2020-12-13T02:35:44.022+0000 7fafb721f700  5 osd.1 pg_epoch: 13143 
pg[19.69s2( v 3437'1753192 (3437'1753192,3437'1753192
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9238> 
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9237> 
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9236> 
2020-12-13T02:35:44.022+0000 7fafb521b700  5 osd.1 pg_epoch: 13143 
pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9235> 
2020-12-13T02:35:44.022+0000 7fafb521b700  1 osd.1 pg_epoch: 13143 
pg[19.3bs10( v 3437'1759161 (3437'1759161,3437'175916
     >
     > And this is where it crashes:
     >
     > Dec 12 21:38:56 ceph-tpa-server1 bash[780507]: debug -9232> 
2020-12-13T02:35:44.022+0000 7fafd02c1700  0 log_channel(cluster) log 
[DBG] : purged_snaps scrub starts
     > Dec 12 21:38:57 ceph-tpa-server1 systemd[1]: 
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Main process 
exited, code=exited, status=1/FAILURE
     > Dec 12 21:38:59 ceph-tpa-server1 systemd[1]: 
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Failed with 
result 'exit-code'.
     > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: 
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Service 
hold-off time over, scheduling restart.
     > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: 
ceph-08fa929a-8e23-11ea-a1a2-ac1f6bf83142@osd.1.service: Scheduled 
restart job, restart counter is at 1.
     > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Stopped Ceph 
osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
     > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Starting Ceph 
osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142...
     > Dec 12 21:39:09 ceph-tpa-server1 systemd[1]: Started Ceph 
osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142.
     >
     > Hope that helps…
     >
     >
     > Thanks,
     > Stefan
     >
     >
     > From: Stefan Wild <swild@xxxxxxxxxxxxx>
     > Date: Saturday, December 12, 2020 at 9:35 PM
     > To: "ceph-users@xxxxxxx" <ceph-users@xxxxxxx>
     > Subject: OSD reboot loop after running out of memory
     >
     > Hi,
     >
     > We recently upgraded a cluster from 15.2.1 to 15.2.5. About 
two days later, one of the server ran out of memory for unknown 
reasons (normally the machine uses about 60 out of 128 GB). Since 
then, some OSDs on that machine get caught in an endless restart 
loop. Logs will just mention system seeing the daemon fail and then 
restarting it. Since the out of memory incident, we’ve have 3 OSDs 
fail this way at separate times. We resorted to wiping the affected 
OSD and re-adding it to the cluster, but it seems as soon as all PGs 
have moved to the OSD, the next one fails.
     >
     > This is also keeping us from re-deploying RGW, which was 
affected by the same out of memory incident, since cephadm runs a 
check and won’t deploy the service unless the cluster is in HEALTH_OK 
status.
     >
     > Any help would be greatly appreciated.
     >
     > Thanks,
     > Stefan
     >
     > _______________________________________________
     > ceph-users mailing list -- ceph-users@xxxxxxx
     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
     _______________________________________________
     ceph-users mailing list -- ceph-users@xxxxxxx
     To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx