Hi,
On 30.08.21 15:36, Oliver Weinmann wrote:
Hi,
we had one failed osd in our cluster that we have replaced. Since then
the cluster is behaving very strange and some ceph commands like ceph
crash or ceph orch are stuck.
Just two unrelated thoughts:
- never use two mons. If one of them fails for whatever reason, your
whole cluster will stop working. A quorum always requires _more_ than
half of the members. Use at least three mons for anything productive (or
five for the paranoid ones).
- this might be debatable....but do not use a cluster network in such a
tiny cluster. It makes deployment a lot more complex without a
significant advantage. Keep it simple. Use a LACP bond covering both
interfaces if possible.
And ontopic:
- find out which daemons have crashed
- you can try to reduce the size of the mon stores by manual compaction
(don't know how to do this in a container setup...)
- consult the mon logs for hints why the store is growing
Regards,
Burkhard
Cluster health:
[root@gedasvl98 ~]# ceph -s
cluster:
id: ec9e031a-cd10-11eb-a3c3-005056b7db1f
health: HEALTH_WARN
mons gedaopl03,gedasvl98 are using a lot of disk space
mon gedasvl98 is low on available space
2 daemons have recently crashed
911 slow ops, oldest one blocked for 62 sec, daemons
[mon.gedaopl03,mon.gedasvl98] have slow ops.
services:
mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
mds: 1/1 daemons up, 1 standby
osd: 9 osds: 9 up (since 27m), 9 in (since 2h)
data:
volumes: 1/1 healthy
pools: 10 pools, 289 pgs
objects: 7.19k objects, 39 GiB
usage: 118 GiB used, 7.7 TiB / 7.8 TiB avail
pgs: 289 active+clean
io:
client: 170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr
If I understand correctly the reason for the mon containers using a
lot of disk space could be due to the failed osd and unclean pgs. The
pgs are clean and so I would expect the mons to free up disk space
again. I have also restarted the active and passive mons, but no
change here. Then I remembered that I recently changed the ips of the
ceph nodes using:
ceph orch host set-addr gedaopl01 192.168.30.200
ceph orch host set-addr gedaopl02 192.168.30.201
ceph orch host set-addr gedaopl03 192.168.30.202
This was mainly because I think I got it all wrong in the first place
deploying the cluster using cephadm. Our nodes have 3 network ports:
1 x 1GB public network 172.28.4.x (used for OS deployment etc.)
1 x 10GB ceph cluster network 192.168.41.x
1 x 10GB ceph public network 192.168.30.x
If I understood correctly the IP of the mons should be one in the
public network (192.168.30.x). Maybe the changes I made have caused
this trouble?
Best Regards,
Oliver
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx