Re: [EXTERNAL] Any ceph constants available?

"Beaman, Joshua (Contractor)" <Joshua_Beaman@xxxxxxxxxxx> · Fri, 3 Feb 2023 23:37:23 +0000

Congrats landing a fun new job!  That’s quite the mess you have to untangle there.

I’d suggest, since all of those versions will support orchestrator/cephadm, running through the cephadm conversion process here: https://docs.ceph.com/en/latest/cephadm/adoption/

That should get you to the point that you can use ceph orch commands to get your versions aligned.

As for why the balancer isn’t working, first is 89 the correct number of OSDs after you added the 4th host?  I’d wonder if your new host is in the correct root of the crush map.  Check `ceph osd tree` to ensure that all storage hosts are equal and subordinate to the same root (probably default).

At 62% raw utilization you should be OK to rebalance, but things get more challenging above 70% full, and downright painful above 80%.

You should also check your pool pg_nums with `ceph osd pool autoscale-status`.   If the autoscaler isn’t enabled, some pg_num adjustments might bump loose the balancer.

It’s concerning that you have 4 pools warning nearful, but 7 pools in the cluster.  This may imply that the pools are not distributed equally among your osds and buckets in your crush map.  Check `ceph osd pool ls detail` and see what crush_rule is assigned to each pool.  If they’re not all the same, you’re going to need to do some digging into your crush map to figure out why and if it’s for a good reason, or poor design or implementation.

Best of luck,
Josh

From: Thomas Cannon <thomas.cannon@xxxxxxxxx>
Date: Friday, February 3, 2023 at 5:02 PM
To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: [EXTERNAL]  Any ceph constants available?

Hello Ceph community.

The company that recently hired me has a 3 mode ceph cluster that has been running and stable. I am the new lone administrator here and do not know ceph and this is my first experience with it.

The issue was that it is/was running out of space, which is why I made a 4th node and attempted to add it into the cluster. Along the way, things have begun to break. The manager daemon on boreal-01 failed to boreal-02 along the way and I tried to get it to fail back to boreal-01, but was unable, and realized while working on it yesterday I realized that the nodes in the cluster are all running different versions of the software. I suspect that might be a huge part of why things aren’t working as expected.

Boreal-01 - the host - 17.2.5:

root@boreal-01:/home/kadmin# ceph -v
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
root@boreal-01:/home/kadmin#

Boreal-01 - the admin docker instance running on the host 17.2.1:

root@boreal-01:/home/kadmin# cephadm shell
Inferring fsid 951fa730-0228-11ed-b1ef-f925f77b75d3
Inferring config /var/lib/ceph/951fa730-0228-11ed-b1ef-f925f77b75d3/mon.boreal-01/config
Using ceph image with id 'e5af760fa1c1' and tag 'v17' created on 2022-06-23 19:49:45 +0000 UTC
quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b <https://urldefense.com/v3/__http://quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b__;!!CQl3mcHX2A!ERXWTWjDf0OO89IxXEOZHRD0kRqiBmcqBpQtABPAF5wrsGCao8AUcYFFpTyIpgo4jMF0e5xSQxJHg8trYfO8oHMBm4tZEQ$ >
root@boreal-01:/# ceph -v
ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
root@boreal-01:/#

Boreal-02 - 15.2.6:

root@boreal-02:/home/kadmin# ceph -v
ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
root@boreal-02:/home/kadmin#

Boreal-03 - 15.2.8:

root@boreal-03:/home/kadmin# ceph -v
ceph version 15.2.18 (f2877ae32a72fc25acadef57597f44988b805c38) octopus (stable)
root@boreal-03:/home/kadmin#

And the host I added - Boreal-04 - 17.2.5:

root@boreal-04:/home/kadmin# ceph -v
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
root@boreal-04:/home/kadmin#

The cluster ins’t rebalancing data, and drives are filling up unevenly, despite auto balancing being on. I can run a df and see that it isn’t working. However it says it is:

root@boreal-01:/# ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.011905",
    "last_optimize_started": "Fri Feb  3 18:39:02 2023",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}
root@boreal-01:/#

root@boreal-01:/# ceph -s
  cluster:
    id:     951fa730-0228-11ed-b1ef-f925f77b75d3
    health: HEALTH_WARN
            There are daemons running an older version of ceph
            6 nearfull osd(s)
            3 pgs not deep-scrubbed in time
            3 pgs not scrubbed in time
            4 pool(s) nearfull
            1 daemons have recently crashed

  services:
    mon: 4 daemons, quorum boreal-01,boreal-02,boreal-03,boreal-04 (age 22h)
    mgr: boreal-02.lqxcvk(active, since 19h), standbys: boreal-03.vxhpad, boreal-01.ejaggu
    mds: 2/2 daemons up, 2 standby
    osd: 89 osds: 89 up (since 5d), 89 in (since 45h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 549 pgs
    objects: 227.23M objects, 193 TiB
    usage:   581 TiB used, 356 TiB / 937 TiB avail
    pgs:     533 active+clean
             16  active+clean+scrubbing+deep

  io:
    client:   55 MiB/s rd, 330 KiB/s wr, 21 op/s rd, 45 op/s wr

root@boreal-01:/#

Part of me suspects that I exacerbated the problems by trying to monkey with boreal-04 for several days, trying to get the drives inside the machine turned into OSDs so that they would be used. One thing I did was attempt to upgrade the code on that machine, and I could have triggered a cluster-wide upgrade that failed outside of 1 and 4. With 2 and 3 not even running the same major release, if I did make that mistake, I can see why instead of an upgrade, things would be worse.

According to the documentation, I should be able to upgrade the entire cluster by running a single command on the admin node, but when I go to run commands I get errors that even google can’t solve:

root@boreal-01:/# ceph orch host ls
Error ENOENT: Module not found
root@boreal-01:/#

Consequently, I have very little faith that running commands to upgrade everything so that it’s all running the same code will work. I think each host could be upgraded and fix things, but do not feel confident doing so and risking our data.

Hopefully that gives a better idea of the problems I am facing. I am hoping for some professional services hours with someone who is a true expert with this software, to get us to a stable and sane deployment that can be managed without it being a terrifying guessing game, trying to get it to work.

If that is you, or if you know someone who can help — please contact me!

Thank you!

Thomas
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx