ceph orch commands stuck

Oliver Weinmann <oliver.weinmann@xxxxxx> · Mon, 30 Aug 2021 13:36:08 -0000

Hi,

we had one failed osd in our cluster that we have replaced. Since then the cluster is behaving very strange and some ceph commands like ceph crash or ceph orch are stuck.

Cluster health:

[root@gedasvl98 ~]# ceph -s
  cluster:
    id:     ec9e031a-cd10-11eb-a3c3-005056b7db1f
    health: HEALTH_WARN
            mons gedaopl03,gedasvl98 are using a lot of disk space
            mon gedasvl98 is low on available space
            2 daemons have recently crashed
            911 slow ops, oldest one blocked for 62 sec, daemons [mon.gedaopl03,mon.gedasvl98] have slow ops.

  services:
    mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
    mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 27m), 9 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 289 pgs
    objects: 7.19k objects, 39 GiB
    usage:   118 GiB used, 7.7 TiB / 7.8 TiB avail
    pgs:     289 active+clean

  io:
    client:   170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr

If I understand correctly the reason for the mon containers using a lot of disk space could be due  to the failed osd and unclean pgs. The pgs are clean and so I would expect the mons to free up disk space again. I have also restarted the active and passive mons, but no change here. Then I remembered that I recently changed the ips of the ceph nodes using:

ceph orch host set-addr gedaopl01 192.168.30.200
ceph orch host set-addr gedaopl02 192.168.30.201
ceph orch host set-addr gedaopl03 192.168.30.202

This was mainly because I think I got it all wrong in the first place deploying the cluster using cephadm. Our nodes have 3 network ports:

1 x 1GB public network 172.28.4.x (used for OS deployment etc.)

1 x 10GB ceph cluster network 192.168.41.x

1 x 10GB ceph public network 192.168.30.x

If I understood correctly the IP of the mons should be one in the public network (192.168.30.x). Maybe the changes I made have caused this trouble?

Best Regards,

Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx