Re: ceph orch commands stuck

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 30 Aug 2021 16:20:51 +0200

Hi,

On 30.08.21 15:36, Oliver Weinmann wrote:
Hi,

we had one failed osd in our cluster that we have replaced. Since then 
the cluster is behaving very strange and some ceph commands like ceph 
crash or ceph orch are stuck.

Just two unrelated thoughts:

- never use two mons. If one of them fails for whatever reason, your 
whole cluster will stop working. A quorum always requires _more_ than 
half of the members. Use at least three mons for anything productive (or 
five for the paranoid ones).

- this might be debatable....but do not use a cluster network in such a 
tiny cluster. It makes deployment a lot more complex without a 
significant advantage. Keep it simple. Use a LACP bond covering both 
interfaces if possible.

And ontopic:

- find out which daemons have crashed

- you can try to reduce the size of the mon stores by manual compaction 
(don't know how to do this in a container setup...)

- consult the mon logs for hints why the store is growing

Regards,

Burkhard

Cluster health:

[root@gedasvl98 ~]# ceph -s
  cluster:
    id:     ec9e031a-cd10-11eb-a3c3-005056b7db1f
    health: HEALTH_WARN
            mons gedaopl03,gedasvl98 are using a lot of disk space
            mon gedasvl98 is low on available space
            2 daemons have recently crashed
            911 slow ops, oldest one blocked for 62 sec, daemons 
[mon.gedaopl03,mon.gedasvl98] have slow ops.

  services:
    mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
    mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 27m), 9 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 289 pgs
    objects: 7.19k objects, 39 GiB
    usage:   118 GiB used, 7.7 TiB / 7.8 TiB avail
    pgs:     289 active+clean

  io:
    client:   170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr

If I understand correctly the reason for the mon containers using a 
lot of disk space could be due  to the failed osd and unclean pgs. The 
pgs are clean and so I would expect the mons to free up disk space 
again. I have also restarted the active and passive mons, but no 
change here. Then I remembered that I recently changed the ips of the 
ceph nodes using:

ceph orch host set-addr gedaopl01 192.168.30.200
ceph orch host set-addr gedaopl02 192.168.30.201
ceph orch host set-addr gedaopl03 192.168.30.202

This was mainly because I think I got it all wrong in the first place 
deploying the cluster using cephadm. Our nodes have 3 network ports:

1 x 1GB public network 172.28.4.x (used for OS deployment etc.)

1 x 10GB ceph cluster network 192.168.41.x

1 x 10GB ceph public network 192.168.30.x

If I understood correctly the IP of the mons should be one in the 
public network (192.168.30.x). Maybe the changes I made have caused 
this trouble?

Best Regards,

Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx