Re: Any ceph constants available?

William Edwards <wedwards@xxxxxxxxxxxxxx> · Sat, 4 Feb 2023 18:24:50 +0100

> Op 4 feb. 2023 om 00:03 heeft Thomas Cannon <thomas.cannon@xxxxxxxxx> het volgende geschreven:
> 
> 
> Hello Ceph community.
> 
> The company that recently hired me has a 3 mode ceph cluster that has been running and stable. I am the new lone administrator here and do not know ceph and this is my first experience with it. 
> 
> The issue was that it is/was running out of space, which is why I made a 4th node and attempted to add it into the cluster. Along the way, things have begun to break. The manager daemon on boreal-01 failed to boreal-02 along the way and I tried to get it to fail back to boreal-01, but was unable, and realized while working on it yesterday I realized that the nodes in the cluster are all running different versions of the software. I suspect that might be a huge part of why things aren’t working as expected. 
> 
> Boreal-01 - the host - 17.2.5:
> 
> root@boreal-01:/home/kadmin# ceph -v
> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
> root@boreal-01:/home/kadmin# 
> 
> Boreal-01 - the admin docker instance running on the host 17.2.1:
> 
> root@boreal-01:/home/kadmin# cephadm shell
> Inferring fsid 951fa730-0228-11ed-b1ef-f925f77b75d3
> Inferring config /var/lib/ceph/951fa730-0228-11ed-b1ef-f925f77b75d3/mon.boreal-01/config
> Using ceph image with id 'e5af760fa1c1' and tag 'v17' created on 2022-06-23 19:49:45 +0000 UTC
> quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b <http://quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b>
> root@boreal-01:/# ceph -v
> ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
> root@boreal-01:/# 
> 
> Boreal-02 - 15.2.6:
> 
> root@boreal-02:/home/kadmin# ceph -v
> ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> root@boreal-02:/home/kadmin# 
> 
> 
> Boreal-03 - 15.2.8:
> 
> root@boreal-03:/home/kadmin# ceph -v
> ceph version 15.2.18 (f2877ae32a72fc25acadef57597f44988b805c38) octopus (stable)
> root@boreal-03:/home/kadmin# 
> 
> And the host I added - Boreal-04 - 17.2.5:
> 
> root@boreal-04:/home/kadmin# ceph -v
> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
> root@boreal-04:/home/kadmin# 
> 
> The cluster ins’t rebalancing data, and drives are filling up unevenly, despite auto balancing being on. I can run a df and see that it isn’t working. However it says it is:
> 
> root@boreal-01:/# ceph balancer status 
> {
>    "active": true,
>    "last_optimize_duration": "0:00:00.011905",
>    "last_optimize_started": "Fri Feb  3 18:39:02 2023",
>    "mode": "upmap",
>    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
>    "plans": []
> }
> root@boreal-01:/# 
> 
> root@boreal-01:/# ceph -s
>  cluster:
>    id:     951fa730-0228-11ed-b1ef-f925f77b75d3
>    health: HEALTH_WARN
>            There are daemons running an older version of ceph
>            6 nearfull osd(s)
>            3 pgs not deep-scrubbed in time
>            3 pgs not scrubbed in time
>            4 pool(s) nearfull
>            1 daemons have recently crashed
> 
>  services:
>    mon: 4 daemons, quorum boreal-01,boreal-02,boreal-03,boreal-04 (age 22h)
>    mgr: boreal-02.lqxcvk(active, since 19h), standbys: boreal-03.vxhpad, boreal-01.ejaggu
>    mds: 2/2 daemons up, 2 standby
>    osd: 89 osds: 89 up (since 5d), 89 in (since 45h)
> 
>  data:
>    volumes: 2/2 healthy
>    pools:   7 pools, 549 pgs
>    objects: 227.23M objects, 193 TiB
>    usage:   581 TiB used, 356 TiB / 937 TiB avail
>    pgs:     533 active+clean
>             16  active+clean+scrubbing+deep
> 
>  io:
>    client:   55 MiB/s rd, 330 KiB/s wr, 21 op/s rd, 45 op/s wr
> 
> root@boreal-01:/# 
> 
> Part of me suspects that I exacerbated the problems by trying to monkey with boreal-04 for several days, trying to get the drives inside the machine turned into OSDs so that they would be used. One thing I did was attempt to upgrade the code on that machine, and I could have triggered a cluster-wide upgrade that failed outside of 1 and 4. With 2 and 3 not even running the same major release, if I did make that mistake, I can see why instead of an upgrade, things would be worse. 
> 
> According to the documentation, I should be able to upgrade the entire cluster by running a single command on the admin node, but when I go to run commands I get errors that even google can’t solve:
> 
> root@boreal-01:/# ceph orch host ls
> Error ENOENT: Module not found
> root@boreal-01:/# 
> 
> Consequently, I have very little faith that running commands to upgrade everything so that it’s all running the same code will work. I think each host could be upgraded and fix things, but do not feel confident doing so and risking our data.
> 
> Hopefully that gives a better idea of the problems I am facing. I am hoping for some professional services hours with someone who is a true expert with this software

I’ve seen 42on.com being recommended before (no affiliation).

> , to get us to a stable and sane deployment that can be managed without it being a terrifying guessing game, trying to get it to work.
> 
> If that is you, or if you know someone who can help — please contact me!
> 
> Thank you!
> 
> Thomas
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx