Antw: Re: reliable monitor restarts

"Steffen Weißgerber" <WeissgerberS@xxxxxxx> · Tue, 25 Oct 2016 18:24:49 +0200

Hi,

thank you for answering.

>>> Wes Dillingham <wes_dillingham@xxxxxxxxxxx> schrieb am Montag, 24.
Oktober 2016
um 17:31:
> What do the logs of the monitor service say? Increase their
verbosity
> and check the logs at the time of the crash. Are you doing any sort
of
> monitoring on the nodes such that you can forensically check what
the
> system was up to prior to the crash?
> 

I'll do this. In normal logging it's only logging that new election is
initiated.

At the moment we are in the situation, that the system disk of one
monitor host is read only
due to disk failure (a buggy sata dom, that we will change).

So the left to monitors do the job.

> As others have said systemd can handle this via unit files, in fact
> this is setup for you when installing ceph (at least in version 10.x
/
> jewel). Which version of Ceph are you running?
> 

Our installation started with Firefly some 2 years ago. At the moment
there should be some
default configuration active because we never configured something like
this. Only installed
system and ceph updates/upgrades.

> Also as others have stated, MON service is very reliable, and should
> not be crashing, we have had zero crashes of mon service in 1.5
years
> of running. Something is afoot.
> 

Yes, I fully agree. But the situation changed slightly with hammer. The
monitors died sporadically
when running ceph/rbd commands.

This was never really problematic (more annoying).

> Also configuration management platforms can ensure daemons remain
> running as well, but this is bootstrap and suspenders with systemd.
> 

I'll check what's possible with those unit files and also increase the
log level to find the source of
the problem.

I was on vacation within the last days and will be back at the office
tomorrow.

Thank you for you help.

Regards

Steffen

> On Sat, Oct 22, 2016 at 6:57 AM, Ruben Kerkhof
<ruben@xxxxxxxxxxxxxxxx> wrote:
>> On Fri, Oct 21, 2016 at 9:31 PM, Steffen Weißgerber
>> <weissgerbers@xxxxxxx> wrote:
>>> Hello,
>>>
>>> we're running a 6 node ceph cluster with 3 mons on Ubuntu
(14.04.4).
>>>
>>> Sometimes it happen's that the mon services die and have to
restarted
>>> manually.
>>>
>>> To have reliable service restarts I normally use D.J. Bernsteins
deamontools
>>> on other Linux distributions. Until now I never did this on
Ubuntu.
>>>
>>> Is there a comparable way to configure such a watcher on services
on Ubuntu
>>> (i.e. under systemd)?
>>
>> Systemd handles this for you.
>> The ceph-mon unit file has:
>>
>> Restart=on-failure
>> StartLimitInterval=30min
>> StartLimitBurst=3
>>
>> Note that systemd only restarts it 3 times in 30 minutes. If it
fails
>> more often, you'll have to reset the unit.
>>
>> You can finetune this with drop-ins, see systemd.service(5) for
details.
>>
>>>
>>> Regards and have a nice weekend.
>>>
>>> Steffen
>>
>> Kind regards,
>>
>> Ruben Kerkhof
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> -- 
> Respectfully,
> 
> Wes Dillingham
> wes_dillingham@xxxxxxxxxxx 
> Research Computing | Infrastructure Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room
210

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com