Re: Ceph monitors 100% full filesystem, refusing start

Wido den Hollander <wido@xxxxxxxx> · Wed, 20 Jan 2016 22:04:11 +0100

On 01/20/2016 08:01 PM, Zoltan Arnold Nagy wrote:
> Wouldn’t actually blowing away the other monitors then recreating them
> from scratch solve the issue?
> 
> Never done this, just thinking out loud. It would grab the osdmap and
> everything from the other monitor and form a quorum, wouldn’t it?
> 

Nope, those monitors will not have any historical OSDMaps which will be
required by OSDs which need to catch up with the cluster.

It might be possible technically by hacking a lot of stuff, but that
won't be easy.

I'm still busy with this btw. The monitors are in a electing state since
2 monitors are still synchronizing and one won't boot anymore :(

>> On 20 Jan 2016, at 16:26, Wido den Hollander <wido@xxxxxxxx
>> <mailto:wido@xxxxxxxx>> wrote:
>>
>> On 01/20/2016 04:22 PM, Zoltan Arnold Nagy wrote:
>>> Hi Wido,
>>>
>>> So one out of the 5 monitors are running fine then? Did that have
>>> more space for it’s leveldb?
>>>
>>
>> Yes. That was at 99% full and by cleaning some stuff in /var/cache and
>> /var/log I was able to start it.
>>
>> It compacted the levelDB database and is now on 1% disk usage.
>>
>> Looking at the ceph_mon.cc code:
>>
>> if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
>>
>> Setting mon_data_avail_crit to 0 does not work since 100% full is equal
>> to 0% free..
>>
>> There is ~300M free on the other 4 monitors. I just can't start the mon
>> and tell it to compact.
>>
>> Lessons learned here though, always make sure you have some additional
>> space you can clear when you need it.
>>
>>>> On 20 Jan 2016, at 16:15, Wido den Hollander <wido@xxxxxxxx
>>>> <mailto:wido@xxxxxxxx>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have an issue with a (not in production!) Ceph cluster which I'm
>>>> trying to resolve.
>>>>
>>>> On Friday the network links between the racks failed and this caused all
>>>> monitors to loose connection.
>>>>
>>>> Their leveldb stores kept growing and they are currently 100% full. They
>>>> all have a few hunderd MB left.
>>>>
>>>> Starting the 'compact on start' doesn't work since the FS is 100%
>>>> full.error: monitor data filesystem reached concerning levels of
>>>> available storage space (available: 0% 238 MB)
>>>> you may adjust 'mon data avail crit' to a lower value to make this go
>>>> away (default: 0%)
>>>>
>>>> On of the 5 monitors is now running but that's not enough.
>>>>
>>>> Any ideas how to compact this leveldb? I can't free up any more space
>>>> right now on these systems. Getting bigger disks in is also going to
>>>> take a lot of time.
>>>>
>>>> Any tools outside the monitors to use here?
>>>>
>>>> Keep in mind, this is a pre-production cluster. We would like to keep
>>>> the cluster and fix this as a good exercise of stuff which could go
>>>> wrong. Dangerous tools are allowed!
>>>>
>>>> -- 
>>>> Wido den Hollander
>>>> 42on B.V.
>>>> Ceph trainer and consultant
>>>>
>>>> Phone: +31 (0)20 700 9902
>>>> Skype: contact42on
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com