Re: nautilus cluster down by loss of 2 mons

Marcel Kuiper <ceph@xxxxxxxx> · Tue, 31 Aug 2021 21:30:25 +0200

During normal operation the size is under 1G. After the network ordeal 
it was 65G. I gave the last mon all diskspace I could find under 
/var/lib/ceph and started the mon again. it is now reaching 90G and 
still growing

Does anyone have an idea howmuch disk free would be needed to get the 
job done?

Any other strategies to get the cluster going again??

Marc schreef op 2021-08-31 20:02:
Could someone also explain the logics behind the decision to dump so
much data to the disk. Especially in container environments with
resource limits this is not really nice.

-----Original Message-----
Sent: Tuesday, 31 August 2021 19:16
To: ceph-users@xxxxxxx
Subject:  nautilus cluster down by loss of 2 mons

Hi

We have a nautilus cluster that was plagued by a network failure. One 
of
the monitors fell out of quorum
Once the network settled down and all osds were back online again we 
got
that mon synchronizing

However the filesystem suddenly exploded in a minute or so from 63G
usage to 93G  resulting in 100% usage
At that point we decided to remove that mon from the cluster and hope 
to
compact the database on the remaining mons so that
we could add a new mon while there was less synchronizing to do 
because
of the smaller database size

Unfortunately the tell osd compact command made the database on mon nr 
2
grow very fast resulting in another full filesystem
hence a dead cluster

Can anyone advice towards the fastest recovery in this situation?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx