Hi Peter, We fixed this bug: https://tracker.ceph.com/issues/47738 recently here: https://github.com/ceph/ceph/commit/b4316d257e928b3789b818054927c2e98bb3c0d6 which should hopefully be in the next release(s). David On Thu, Jun 17, 2021 at 12:13 PM Peter Childs <pchilds@xxxxxxx> wrote: > > Found the issue in the end I'd managed to kill the autoscaling features by > playing with pgp_num and pg_num and it was getting confusing. I fixed it in > the end by reducing pg_num on some of my test pools and the manager woke up > and started working again. > > It was not clear as to what I'd done to kill it but once I'd figured out > what was crashing it was possible to figure out what was going to help it. > > So I've just learnt, Don't play with pgp_num and pg_num and let the > autoscaling feature just work. setting the target size or ratio is probably > better. > > I like ceph its very different to Spectrum Scale which I've used for years, > but for now its different tools to resolve different issues. > > Must get around to doing something with what I've learnt so far. > > Peter > > On Thu, 17 Jun 2021 at 17:53, Eugen Block <eblock@xxxxxx> wrote: > > > Hi, > > > > don't give up on Ceph. ;-) > > > > Did you try any of the steps from the troubleshooting section [1] to > > gather some events and logs? Could you share them, and maybe also some > > more details about that cluster? Did you enable any non-default mgr > > modules? There have been a couple reports related to mgr modules. > > > > Regards > > Eugen > > > > [1] https://docs.ceph.com/en/latest/cephadm/troubleshooting/ > > > > > > Zitat von Peter Childs <pchilds@xxxxxxx>: > > > > > Lets try to stop this message turning into a mass moaning session about > > > Ceph and try and get this newbie able to use it. > > > > > > I've got a Ceph Octopus cluster, its relatively new and deployed using > > > cephadm. > > > > > > It was working fine, but now the managers start up run for about 30 > > seconds > > > and then die, until systemctl gives up and I have to reset-fail them to > > get > > > them to try again, when they fail. > > > > > > How do I work out why and get them working again? > > > > > > I've got 21 nodes and was looking to take it up to 32 over the next few > > > weeks, but that is going to be difficult if the managers are not working. > > > > > > I did try Pacific and I'm happy to upgrade but that failed to deploy more > > > than 6 osd's and I gave up and went back to Octopus. > > > > > > I'm about to give up on Ceph because it looks like its really really > > > "fragile" and debugging what's going wrong is really difficult. > > > > > > I guess I could give up on cephadm and go with a different provisioning > > > method but I'm not sure where to start on that. > > > > > > Thanks in advance. > > > > > > Peter. > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx