Re: Ceph monitors 100% full filesystem, refusing start

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 21 Jan 2016 09:15:41 +0100

On Wed, Jan 20, 2016 at 8:01 PM, Zoltan Arnold Nagy
<zoltan@xxxxxxxxxxxxxxxxxx> wrote:
>
> Wouldn’t actually blowing away the other monitors then recreating them from scratch solve the issue?
>
> Never done this, just thinking out loud. It would grab the osdmap and everything from the other monitor and form a quorum, wouldn’t it?
>

Recreating monitors works as long as the others can form a quorum.
I've done this many many times.

In Wido's case he might have been able to solve this by rm'ing the
broken mon's from the cluster until the one remaining formed a quorum
with it self, then slowly add the other mon's back.

--
dan

>
> On 20 Jan 2016, at 16:26, Wido den Hollander <wido@xxxxxxxx> wrote:
>
> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
>
> Hi Wido,
>
> So one out of the 5 monitors are running fine then? Did that have more space for it’s leveldb?
>
>
> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
> /var/log I was able to start it.
>
> It compacted the levelDB database and is now on 1% disk usage.
>
> Looking at the ceph_mon.cc code:
>
> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
>
> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
> to 0% free..
>
> There is ~300M free on the other 4 monitors. I just can't start the mon
> and tell it to compact.
>
> Lessons learned here though, always make sure you have some additional
> space you can clear when you need it.
>
> On 20 Jan 2016, at 16:15, Wido den Hollander <wido@xxxxxxxx> wrote:
>
> Hello,
>
> I have an issue with a (not in production!) Ceph cluster which I'm
> trying to resolve.
>
> On Friday the network links between the racks failed and this caused all
> monitors to loose connection.
>
> Their leveldb stores kept growing and they are currently 100% full. They
> all have a few hunderd MB left.
>
> Starting the 'compact on start' doesn't work since the FS is 100%
> full.error: monitor data filesystem reached concerning levels of
> available storage space (available: 0% 238 MB)
> you may adjust 'mon data avail crit' to a lower value to make this go
> away (default: 0%)
>
> On of the 5 monitors is now running but that's not enough.
>
> Any ideas how to compact this leveldb? I can't free up any more space
> right now on these systems. Getting bigger disks in is also going to
> take a lot of time.
>
> Any tools outside the monitors to use here?
>
> Keep in mind, this is a pre-production cluster. We would like to keep
> the cluster and fix this as a good exercise of stuff which could go
> wrong. Dangerous tools are allowed!
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com