Re: Antw: Re: reliable monitor restarts

Wido den Hollander <wido@xxxxxxxx> · Tue, 25 Oct 2016 18:55:24 +0200 (CEST)

> Op 25 oktober 2016 om 18:24 schreef Steffen Weißgerber <WeissgerberS@xxxxxxx>:
> 
> 
> Hi,
> 
> thank you for answering.
> 
> 
> >>> Wes Dillingham <wes_dillingham@xxxxxxxxxxx> schrieb am Montag, 24.
> Oktober 2016
> um 17:31:
> > What do the logs of the monitor service say? Increase their
> verbosity
> > and check the logs at the time of the crash. Are you doing any sort
> of
> > monitoring on the nodes such that you can forensically check what
> the
> > system was up to prior to the crash?
> > 
> 
> I'll do this. In normal logging it's only logging that new election is
> initiated.
> 
> At the moment we are in the situation, that the system disk of one
> monitor host is read only
> due to disk failure (a buggy sata dom, that we will change).
> 

Warning! Although Monitors do not require a lot of storage nor performance they DO require RELIABLE storage. A SATADOM is NOT reliable. Sorry for the caps, but I'm trying to prevent a disaster here.

Please, buy a datacenter grade SSD like the Intel S3710 or Samsung SM836 for your Monitors. If the storage underneath them starts to fail you have a serious problem. If you loose all your monitors you effectively loose your cluster.

Wido

> So the left to monitors do the job.
> 
> > As others have said systemd can handle this via unit files, in fact
> > this is setup for you when installing ceph (at least in version 10.x
> /
> > jewel). Which version of Ceph are you running?
> > 
> 
> Our installation started with Firefly some 2 years ago. At the moment
> there should be some
> default configuration active because we never configured something like
> this. Only installed
> system and ceph updates/upgrades.
> 
> > Also as others have stated, MON service is very reliable, and should
> > not be crashing, we have had zero crashes of mon service in 1.5
> years
> > of running. Something is afoot.
> > 
> 
> Yes, I fully agree. But the situation changed slightly with hammer. The
> monitors died sporadically
> when running ceph/rbd commands.
> 
> This was never really problematic (more annoying).
> 
> > Also configuration management platforms can ensure daemons remain
> > running as well, but this is bootstrap and suspenders with systemd.
> > 
> 
> I'll check what's possible with those unit files and also increase the
> log level to find the source of
> the problem.
> 
> I was on vacation within the last days and will be back at the office
> tomorrow.
> 
> Thank you for you help.
> 
> Regards
> 
> Steffen
> 
> > On Sat, Oct 22, 2016 at 6:57 AM, Ruben Kerkhof
> <ruben@xxxxxxxxxxxxxxxx> wrote:
> >> On Fri, Oct 21, 2016 at 9:31 PM, Steffen Weißgerber
> >> <weissgerbers@xxxxxxx> wrote:
> >>> Hello,
> >>>
> >>> we're running a 6 node ceph cluster with 3 mons on Ubuntu
> (14.04.4).
> >>>
> >>> Sometimes it happen's that the mon services die and have to
> restarted
> >>> manually.
> >>>
> >>> To have reliable service restarts I normally use D.J. Bernsteins
> deamontools
> >>> on other Linux distributions. Until now I never did this on
> Ubuntu.
> >>>
> >>> Is there a comparable way to configure such a watcher on services
> on Ubuntu
> >>> (i.e. under systemd)?
> >>
> >> Systemd handles this for you.
> >> The ceph-mon unit file has:
> >>
> >> Restart=on-failure
> >> StartLimitInterval=30min
> >> StartLimitBurst=3
> >>
> >> Note that systemd only restarts it 3 times in 30 minutes. If it
> fails
> >> more often, you'll have to reset the unit.
> >>
> >> You can finetune this with drop-ins, see systemd.service(5) for
> details.
> >>
> >>>
> >>> Regards and have a nice weekend.
> >>>
> >>> Steffen
> >>
> >> Kind regards,
> >>
> >> Ruben Kerkhof
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> > 
> > 
> > -- 
> > Respectfully,
> > 
> > Wes Dillingham
> > wes_dillingham@xxxxxxxxxxx 
> > Research Computing | Infrastructure Engineer
> > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room
> 210
> 
> -- 
> Klinik-Service Neubrandenburg GmbH
> Allendestr. 30, 17036 Neubrandenburg
> Amtsgericht Neubrandenburg, HRB 2457
> Geschaeftsfuehrerin: Gudrun Kappich
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com