Re: Ceph Managers dieing?

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 17 Jun 2021 16:23:02 -0500

Hi Peter,

We fixed this bug: https://tracker.ceph.com/issues/47738 recently
here: https://github.com/ceph/ceph/commit/b4316d257e928b3789b818054927c2e98bb3c0d6
which should hopefully be in the next release(s).

David

On Thu, Jun 17, 2021 at 12:13 PM Peter Childs <pchilds@xxxxxxx> wrote:
>
> Found the issue in the end  I'd managed to kill the autoscaling features by
> playing with pgp_num and pg_num and it was getting confusing. I fixed it in
> the end by reducing pg_num on some of my test pools and the manager woke up
> and started working again.
>
> It was not clear as to what I'd done to kill it but once I'd figured out
> what was crashing it was possible to figure out what was going to help it.
>
> So I've just learnt, Don't play with pgp_num and pg_num and let the
> autoscaling feature just work. setting the target size or ratio is probably
> better.
>
> I like ceph its very different to Spectrum Scale which I've used for years,
> but for now its different tools to resolve different issues.
>
> Must get around to doing something with what I've learnt so far.
>
> Peter
>
> On Thu, 17 Jun 2021 at 17:53, Eugen Block <eblock@xxxxxx> wrote:
>
> > Hi,
> >
> > don't give up on Ceph. ;-)
> >
> > Did you try any of the steps from the troubleshooting section [1] to
> > gather some events and logs? Could you share them, and maybe also some
> > more details about that cluster? Did you enable any non-default mgr
> > modules? There have been a couple reports related to mgr modules.
> >
> > Regards
> > Eugen
> >
> > [1] https://docs.ceph.com/en/latest/cephadm/troubleshooting/
> >
> >
> > Zitat von Peter Childs <pchilds@xxxxxxx>:
> >
> > > Lets try to stop this message turning into a mass moaning session about
> > > Ceph and try and get this newbie able to use it.
> > >
> > > I've got a Ceph Octopus cluster, its relatively new and deployed using
> > > cephadm.
> > >
> > > It was working fine, but now the managers start up run for about 30
> > seconds
> > > and then die, until systemctl gives up and I have to reset-fail them to
> > get
> > > them to try again, when they fail.
> > >
> > > How do I work out why and get them working again?
> > >
> > > I've got 21 nodes and was looking to take it up to 32 over the next few
> > > weeks, but that is going to be difficult if the managers are not working.
> > >
> > > I did try Pacific and I'm happy to upgrade but that failed to deploy more
> > > than 6 osd's and I gave up and went back to Octopus.
> > >
> > > I'm about to give up on Ceph because it looks like its really really
> > > "fragile" and debugging what's going wrong is really difficult.
> > >
> > > I guess I could give up on cephadm and go with a different provisioning
> > > method but I'm not sure where to start on that.
> > >
> > > Thanks in advance.
> > >
> > > Peter.
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx