Re: Ceph Managers dieing?

Andrew Walker-Brown <andrew_jbrown@xxxxxxxxxxx> · Thu, 17 Jun 2021 17:57:35 +0000

Changing pg_num and pgp_num manually can be a useful tool.  Just remember that they need to be factor of 2, don’t increase or decease more than a couple of steps e.g. 64 to 128 or 256….but not to 1024 etc. 

I had a situation where a couple of OSDs got quite full. I added more capacity but the rebalance got stuck as there wasn’t enough space on one of the full’ish OSDs to put a pg. 

I increased the pg_num and pgp_num (doubled)…this effectively made the pg’s smaller so Ceph could finish the rebalance, it could squeeze a pg in the space left on the OSDs. Once that was done I just turned autoscale back on. 

Sent from my iPhone

On 17 Jun 2021, at 18:13, Peter Childs <pchilds@xxxxxxx> wrote:

Found the issue in the end  I'd managed to kill the autoscaling features by
playing with pgp_num and pg_num and it was getting confusing. I fixed it in
the end by reducing pg_num on some of my test pools and the manager woke up
and started working again.

It was not clear as to what I'd done to kill it but once I'd figured out
what was crashing it was possible to figure out what was going to help it.

So I've just learnt, Don't play with pgp_num and pg_num and let the
autoscaling feature just work. setting the target size or ratio is probably
better.

I like ceph its very different to Spectrum Scale which I've used for years,
but for now its different tools to resolve different issues.

Must get around to doing something with what I've learnt so far.

Peter

On Thu, 17 Jun 2021 at 17:53, Eugen Block <eblock@xxxxxx> wrote:

> Hi,
> 
> don't give up on Ceph. ;-)
> 
> Did you try any of the steps from the troubleshooting section [1] to
> gather some events and logs? Could you share them, and maybe also some
> more details about that cluster? Did you enable any non-default mgr
> modules? There have been a couple reports related to mgr modules.
> 
> Regards
> Eugen
> 
> [1] https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Flatest%2Fcephadm%2Ftroubleshooting%2F&amp;data=04%7C01%7C%7C1b5c18d7d4324fa1f9d708d931b34476%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637595468280730494%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MwFUwPG4hX3IJ4DvRdPMie25KmJ9XzaAgzMw3v%2FnRwg%3D&amp;reserved=0
> 
> 
> Zitat von Peter Childs <pchilds@xxxxxxx>:
> 
>> Lets try to stop this message turning into a mass moaning session about
>> Ceph and try and get this newbie able to use it.
>> 
>> I've got a Ceph Octopus cluster, its relatively new and deployed using
>> cephadm.
>> 
>> It was working fine, but now the managers start up run for about 30
> seconds
>> and then die, until systemctl gives up and I have to reset-fail them to
> get
>> them to try again, when they fail.
>> 
>> How do I work out why and get them working again?
>> 
>> I've got 21 nodes and was looking to take it up to 32 over the next few
>> weeks, but that is going to be difficult if the managers are not working.
>> 
>> I did try Pacific and I'm happy to upgrade but that failed to deploy more
>> than 6 osd's and I gave up and went back to Octopus.
>> 
>> I'm about to give up on Ceph because it looks like its really really
>> "fragile" and debugging what's going wrong is really difficult.
>> 
>> I guess I could give up on cephadm and go with a different provisioning
>> method but I'm not sure where to start on that.
>> 
>> Thanks in advance.
>> 
>> Peter.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx