Re: ceph orch commands stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 30.08.21 15:36, Oliver Weinmann wrote:
Hi,



we had one failed osd in our cluster that we have replaced. Since then the cluster is behaving very strange and some ceph commands like ceph crash or ceph orch are stuck.


Just two unrelated thoughts:


- never use two mons. If one of them fails for whatever reason, your whole cluster will stop working. A quorum always requires _more_ than half of the members. Use at least three mons for anything productive (or five for the paranoid ones).


- this might be debatable....but do not use a cluster network in such a tiny cluster. It makes deployment a lot more complex without a significant advantage. Keep it simple. Use a LACP bond covering both interfaces if possible.


And ontopic:

- find out which daemons have crashed

- you can try to reduce the size of the mon stores by manual compaction (don't know how to do this in a container setup...)

- consult the mon logs for hints why the store is growing


Regards,

Burkhard




Cluster health:



[root@gedasvl98 ~]# ceph -s
  cluster:
    id:     ec9e031a-cd10-11eb-a3c3-005056b7db1f
    health: HEALTH_WARN
            mons gedaopl03,gedasvl98 are using a lot of disk space
            mon gedasvl98 is low on available space
            2 daemons have recently crashed
            911 slow ops, oldest one blocked for 62 sec, daemons [mon.gedaopl03,mon.gedasvl98] have slow ops.

  services:
    mon: 2 daemons, quorum gedasvl98,gedaopl03 (age 27m)
    mgr: gedaopl01.fjpsnc(active, since 44m), standbys: gedaopl03.japugq
    mds: 1/1 daemons up, 1 standby
    osd: 9 osds: 9 up (since 27m), 9 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 289 pgs
    objects: 7.19k objects, 39 GiB
    usage:   118 GiB used, 7.7 TiB / 7.8 TiB avail
    pgs:     289 active+clean

  io:
    client:   170 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr



If I understand correctly the reason for the mon containers using a lot of disk space could be due  to the failed osd and unclean pgs. The pgs are clean and so I would expect the mons to free up disk space again. I have also restarted the active and passive mons, but no change here. Then I remembered that I recently changed the ips of the ceph nodes using:



ceph orch host set-addr gedaopl01 192.168.30.200
ceph orch host set-addr gedaopl02 192.168.30.201
ceph orch host set-addr gedaopl03 192.168.30.202



This was mainly because I think I got it all wrong in the first place deploying the cluster using cephadm. Our nodes have 3 network ports:



1 x 1GB public network 172.28.4.x (used for OS deployment etc.)

1 x 10GB ceph cluster network 192.168.41.x

1 x 10GB ceph public network 192.168.30.x



If I understood correctly the IP of the mons should be one in the public network (192.168.30.x). Maybe the changes I made have caused this trouble?



Best Regards,

Oliver






_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux