Re: Error ENOENT: Module not found - ceph orch commands stoppd working

Torkil Svensgaard <torkil@xxxxxxxx> · Tue, 12 Nov 2024 10:04:45 +0100

On 12-11-2024 09:55, Eugen Block wrote:
I think the Reef backport will be available in the next point release 
(18.2.5). Squid should already have it, if I'm not mistaken. But I'm not 
sure if you want to upgrade just to mitigate this issue.
You can extract the faulty key:

ceph config-key get mgr/cephadm/osd_remove_queue > osd_remove_queue.json

Then only remove the "original_weight" key from that json and upload it 
back to the config-key store:

ceph config-key set mgr/cephadm/osd_remove_queue -i 
osd_remove_queue_modified.json

Then fail the mgr:

ceph mgr fail

And then it hopefully works again.

Indeed it did, thanks! =)

Mvh.

Torkil

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

On 12-11-2024 09:29, Eugen Block wrote:
Hi Torkil,

Hi Eugen

this sounds suspiciously like https://tracker.ceph.com/issues/67329
Do you have the same (or similar) stack trace in the mgr log pointing 
to osd_remove_queue? You seem to have removed some OSDs, that would 
fit the description as well...

Indeed, had just put a host into drain and there's this in the log:

"
2024-11-12T08:10:48.390+0000 7f1b2e088640 -1 mgr load Failed to 
construct class in 'cephadm'
2024-11-12T08:10:48.390+0000 7f1b2e088640 -1 mgr load Traceback (most 
recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 924, in 
load_from_store
    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 789, in 
from_json
    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 
'original_weight'

2024-11-12T08:10:48.392+0000 7f1b2e088640 -1 mgr operator() Failed to 
run module in active mode ('cephadm')
"

It's not clear to me from the tracker how to recover though. The issue 
seems to be resolved, so should I be able to just pull new container 
images somehow?

Mvh.

Torkil

Regards,
Eugen

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

Hi

18.2.4.

After failing over the active manager ceph orch commands seems to 
have stopped working. There's this in the mgr log:

"
2024-11-12T08:16:30.136+0000 7f1b2d887640  0 log_channel(audit) log 
[DBG] : from='client.2088861125 -' entity='client.admin' 
cmd=[{"prefix": "orch osd rm status", "target": ["mon-mgr", ""]}]: 
dispatch
2024-11-12T08:16:30.136+0000 7f1b23cf4640 -1 no module 'cephadm'
2024-11-12T08:16:30.136+0000 7f1b23cf4640 -1 no module 'cephadm'
2024-11-12T08:16:30.136+0000 7f1b23cf4640 -1 mgr.server reply reply 
(2) No such file or directory Module not found
"

The module is still enabled:

"
[root@ceph-flash1 ~]# ceph mgr module ls
MODULE
balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
alerts                on
cephadm               on
dashboard             on
insights              on
iostat                on
nfs                   on
prometheus            on
stats                 on
diskprediction_local  -
influx                -
k8sevents             -
localpool             -
mds_autoscaler        -
mirroring             -
osd_perf_query        -
osd_support           -
restful               -
rgw                   -
rook                  -
selftest              -
snap_schedule         -
telegraf              -
test_orchestrator     -
zabbix                -
"

Cluster is working:

"
[root@ceph-flash1 ~]# ceph -s
  cluster:
    id:     8ee2d228-ed21-4580-8bbf-0649f229e21d
    health: HEALTH_WARN
            noout flag(s) set
            5 nearfull osd(s)
            Degraded data redundancy: 22742466/3557557778 objects 
degraded (0.639%), 559 pgs degraded, 559 pgs undersized
            4 pool(s) nearfull

  services:
    mon: 5 daemons, quorum 
ceph-flash1,ceph-flash2,ceph-flash3,grouchy,klutzy (age 8d)
    mgr: ceph-flash2.utlhuz(active, since 10m), standbys: 
ceph-flash3.ciudre, ceph-flash1.erhakb
    mds: 1/1 daemons up, 2 standby
    osd: 567 osds: 555 up (since 14h), 555 in (since 4d); 2689 
remapped pgs
         flags noout

  data:
    volumes: 1/1 healthy
    pools:   17 pools, 15521 pgs
    objects: 619.72M objects, 1.3 PiB
    usage:   2.3 PiB used, 2.0 PiB / 4.3 PiB avail
    pgs:     22742466/3557557778 objects degraded (0.639%)
             135987731/3557557778 objects misplaced (3.823%)
             12832 active+clean
             2111  active+remapped+backfill_wait
             479   active+undersized+degraded+remapped+backfill_wait
             80    active+undersized+degraded+remapped+backfilling
             19    active+remapped+backfilling

  io:
    client:   73 MiB/s rd, 5.4 MiB/s wr, 574 op/s rd, 169 op/s wr
    recovery: 2.9 GiB/s, 1.01k objects/s
"

Suggestions? I tried failing over the manager again which didn't help.

Mvh.

Torkil
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx