Re: Continuing Ceph Issues with OSDs falling over

Eugen Block <eblock@xxxxxx> · Wed, 07 Jul 2021 11:52:23 +0000

Hi,

can you tell a bit more what exactly happens?

Currently I'm having an issue where every time I add a new server it adds
the osd on the node and then a few random ods on the current hosts will all
fall over and I'll only be able to get them up again by restart the daemons.

What is the "current host"? The new you're adding or a different one?  
I'm not sure if I understand this correctly, please correct me in case  
I don't. Are you saying that adding a new host with new OSDs causes  
other OSDs on other hosts to fail?

A couple of weeks ago you were struggling to bring OSDs up, I assume  
that is not an issue anymore? But in that thread [1] you mentioned  
lots of disks per node but only 48 GB of RAM per node. Is that still  
the setup you're referring to? If not, please add more information.
If it is the same setup (which I assume) you could hit a memory issue.  
The default osd_memory_target is 4 GB so with 12 OSDs you already hit  
the limit of your nodes, not taking into account the host OS. When you  
add new OSDs the cluster has to rebalance and remap PGs, this causes a  
higher load to all OSDs possibly leading to flapping OSDs.
You could reduce the memory target to see if that helps as a temporary  
workaround.

Regards,
Eugen

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/DYYEQHSAW3US6ECWEKBCMSOWV4C7WROL/

Zitat von Peter Childs <pchilds@xxxxxxx>:

I'm still attempting to build a ceph cluster and I'm currently getting
nowhere very very quickly. From what I can tell I have a slightly unstable
setup and I'm yet to work out why.

I currently have 24 servers and I'm planning to increase this to around 48
These servers are in three groups with different types of disks (and
number) in each type.

Currently I'm having an issue where every time I add a new server it adds
the osd on the node and then a few random ods on the current hosts will all
fall over and I'll only be able to get them up again by restart the daemons.

I'm using cephadm. and the network is a QDR based IB network running IP
over IB so its meant to be 40G but currently is behaving more like 10G
(when I've tested it) Its still faster than the 1G management network I've
also got.

The machines are mostly running debian. There are a few machine running
CentOS7 I'm meaning to redeploy when I get the time (so I can upgrade to
Pacific)

I'm running Octopus 15.2.13, I'm more than happy to change stuff I'm still
trying to learn stuff so there is no data that I care about quite yet, I
was looking for more stability before I go there.

I really just want to know where to look for the problems rather than any
exact answers, I'm yet to see any clues that might help

Thanks in advance

Peter Childs
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx