Hi,
can you tell a bit more what exactly happens?
Currently I'm having an issue where every time I add a new server it adds
the osd on the node and then a few random ods on the current hosts will all
fall over and I'll only be able to get them up again by restart the daemons.
What is the "current host"? The new you're adding or a different one?
I'm not sure if I understand this correctly, please correct me in case
I don't. Are you saying that adding a new host with new OSDs causes
other OSDs on other hosts to fail?
A couple of weeks ago you were struggling to bring OSDs up, I assume
that is not an issue anymore? But in that thread [1] you mentioned
lots of disks per node but only 48 GB of RAM per node. Is that still
the setup you're referring to? If not, please add more information.
If it is the same setup (which I assume) you could hit a memory issue.
The default osd_memory_target is 4 GB so with 12 OSDs you already hit
the limit of your nodes, not taking into account the host OS. When you
add new OSDs the cluster has to rebalance and remap PGs, this causes a
higher load to all OSDs possibly leading to flapping OSDs.
You could reduce the memory target to see if that helps as a temporary
workaround.
Regards,
Eugen
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/DYYEQHSAW3US6ECWEKBCMSOWV4C7WROL/
Zitat von Peter Childs <pchilds@xxxxxxx>:
I'm still attempting to build a ceph cluster and I'm currently getting
nowhere very very quickly. From what I can tell I have a slightly unstable
setup and I'm yet to work out why.
I currently have 24 servers and I'm planning to increase this to around 48
These servers are in three groups with different types of disks (and
number) in each type.
Currently I'm having an issue where every time I add a new server it adds
the osd on the node and then a few random ods on the current hosts will all
fall over and I'll only be able to get them up again by restart the daemons.
I'm using cephadm. and the network is a QDR based IB network running IP
over IB so its meant to be 40G but currently is behaving more like 10G
(when I've tested it) Its still faster than the 1G management network I've
also got.
The machines are mostly running debian. There are a few machine running
CentOS7 I'm meaning to redeploy when I get the time (so I can upgrade to
Pacific)
I'm running Octopus 15.2.13, I'm more than happy to change stuff I'm still
trying to learn stuff so there is no data that I care about quite yet, I
was looking for more stability before I go there.
I really just want to know where to look for the problems rather than any
exact answers, I'm yet to see any clues that might help
Thanks in advance
Peter Childs
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx