Re: Ceph monitors 100% full filesystem, refusing start

Wido den Hollander <wido@xxxxxxxx> · Tue, 26 Jan 2016 21:37:51 +0100

On 01/21/2016 09:15 AM, Dan van der Ster wrote:
> On Wed, Jan 20, 2016 at 8:01 PM, Zoltan Arnold Nagy
> <zoltan@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Wouldn’t actually blowing away the other monitors then recreating them from scratch solve the issue?
>>
>> Never done this, just thinking out loud. It would grab the osdmap and everything from the other monitor and form a quorum, wouldn’t it?
>>
> 
> Recreating monitors works as long as the others can form a quorum.
> I've done this many many times.
> 
> In Wido's case he might have been able to solve this by rm'ing the
> broken mon's from the cluster until the one remaining formed a quorum
> with it self, then slowly add the other mon's back.
> 

So, this was a long exercise, but it's alive again! It seems almost
impossible to kill Ceph :)

This is a 1800 OSD, 180 node cluster which suffered a major network outage.

The cluster is spread out over 6 racks and the core-routing was hit
causing all racks to loose connectivity. It worked inside the rack, but
not outside.

The monitors didn't purge any of their maps since they lost quorum so
they all filled up to 100%.

After running this command I cleared some space:

$ ceph-kvstore-tool /var/lib/ceph/mon/X/store.db list

You need about 30M of free space before it works, but in my case it then
saved me 3GB.

Just enough to start the monitors. From the 5 monitors I was able to
form a quorum with 3, so that was OK.

Many OSDs were also down, so I had to bring those up.

Recovery then started, but during recovery  the maps kept on growing and
growing on the monitors. My best guess is that it was due to not all
monitors being in the quorum.

I eventually removed 2 monitors so I had only 3 left which were able to
form a quorum.

To prevent even more changes in the cluster I also set the pause flag:

$ ceph osd set pause

Stuff stabilized and I restarted the monitors with 'compact on start'
one by one.

This shrunk their data stores which was good.

It took the cluster a few hours to become healthy again and for all PGs
to reach active+clean

In the meantime the cluster suffered a few disk failures, but after all
it came back and no data was lost.

Wido

> --
> dan
> 
> 
> 
>>
>> On 20 Jan 2016, at 16:26, Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
>>
>> Hi Wido,
>>
>> So one out of the 5 monitors are running fine then? Did that have more space for it’s leveldb?
>>
>>
>> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
>> /var/log I was able to start it.
>>
>> It compacted the levelDB database and is now on 1% disk usage.
>>
>> Looking at the ceph_mon.cc code:
>>
>> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
>>
>> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
>> to 0% free..
>>
>> There is ~300M free on the other 4 monitors. I just can't start the mon
>> and tell it to compact.
>>
>> Lessons learned here though, always make sure you have some additional
>> space you can clear when you need it.
>>
>> On 20 Jan 2016, at 16:15, Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>> Hello,
>>
>> I have an issue with a (not in production!) Ceph cluster which I'm
>> trying to resolve.
>>
>> On Friday the network links between the racks failed and this caused all
>> monitors to loose connection.
>>
>> Their leveldb stores kept growing and they are currently 100% full. They
>> all have a few hunderd MB left.
>>
>> Starting the 'compact on start' doesn't work since the FS is 100%
>> full.error: monitor data filesystem reached concerning levels of
>> available storage space (available: 0% 238 MB)
>> you may adjust 'mon data avail crit' to a lower value to make this go
>> away (default: 0%)
>>
>> On of the 5 monitors is now running but that's not enough.
>>
>> Any ideas how to compact this leveldb? I can't free up any more space
>> right now on these systems. Getting bigger disks in is also going to
>> take a lot of time.
>>
>> Any tools outside the monitors to use here?
>>
>> Keep in mind, this is a pre-production cluster. We would like to keep
>> the cluster and fix this as a good exercise of stuff which could go
>> wrong. Dangerous tools are allowed!
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com