Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Wed, 6 Jul 2022 16:27:00 +0000

Hey Andreas,

thanks for the info.

We also had our MGR reporting crashes related to the module.

We have a second cluster as mirror which we also updated to Quincy.
But there the MGR is able to use the snap_module (so "ceph fs 
snap-schedule status" etc are not complaining).
And I'm able to schedule snapshots. But we didn't had any schedules 
there before the upgrade (due to being the mirror).

I also noticed that this particular part of the code you mentioned 
hasn't been touched in  a year and half:
https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193

So I'm wondering if my previous schedule entries got somehow 
incompatible with the new version.

Do you know if there is any way to reset/cleanup the modules config / 
database?
So remove all the previously scheduled snapshots but without using "fs 
snap-schedule remove"?
We only have a handful of schedules which can easily be recreated.
So maybe a clean start would be at least workaround.

Otherwise we will keep simple cron jobs until these issues are fixed.
After all, you just need regularly executed mkdir and rmdir to get you 
started.

Best Wishes,
Mathias

On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
> Hello Mathias and others,
>
> I also ran into this problem after upgrading from 16.2.9 to 17.2.1.
>
> Additionally I observed a health warning: "3 mgr modules have recently 
> crashed".
>
> Those are actually two distinct crashes that are already in the tracker:
>
> https://tracker.ceph.com/issues/56269 and
> https://tracker.ceph.com/issues/56270
>
> Considering that the crashes are in the snap_schedule module I assume 
> that they are the reason why the module is not available.
>
> I can reproduce the crash in 56270 by failing over the mgr.
>
> I believe that the faulty code causing the error is this line: 
> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193
>
> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be 
> ioctx.remove_object(SNAP_DB_OBJECT_NAME).
>
> (According to my understanding of 
> https://docs.ceph.com/en/latest/rados/api/python/.)
>
> Best regards,
>
> Andreas
>
>
> On 01.07.22 18:05, Kuhring, Mathias wrote:
>> Dear Ceph community,
>>
>> After upgrading our cluster to Quincy with cephadm (ceph orch upgrade 
>> start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate 
>> the snapshot schedule module:
>>
>> 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
>> 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
>> snap_schedule         on
>>
>> 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
>> Error ENOENT: Module 'snap_schedule' is not available
>>
>> I tried restarting the MGR daemons and failed over a restarted one. 
>> But with no change.
>>
>> 0|0[root@osd-1 ~]# ceph orch restart mgr
>> Scheduled to restart mgr.osd-1 on host 'osd-1'
>> Scheduled to restart mgr.osd-2 on host 'osd-2'
>> Scheduled to restart mgr.osd-3 on host 'osd-3'
>> Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
>> Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'
>>
>> 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
>> NAME              HOST   PORTS        STATUS         REFRESHED AGE  
>> MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
>> mgr.osd-1         osd-1  *:8443,9283  running (61s)    35s ago 9M     
>> 402M        -  17.2.1   e5af760fa1c1  64f7ec70a6aa
>> mgr.osd-2         osd-2  *:8443,9283  running (47s)    36s ago 9M     
>> 103M        -  17.2.1   e5af760fa1c1  d25fdc793ff8
>> mgr.osd-3         osd-3  *:8443,9283  running (7h)     36s ago 9M     
>> 457M        -  17.2.1   e5af760fa1c1  46d5091e50d6
>> mgr.osd-4.oylrhe  osd-4  *:8443,9283  running (7h)     79s ago 9M     
>> 795M        -  17.2.1   e5af760fa1c1  efb2a7cc06c5
>> mgr.osd-5.jcfyqe  osd-5  *:8443,9283  running (8h)     37s ago 9M     
>> 448M        -  17.2.1   e5af760fa1c1  96dd03817f32
>>
>> 0|0[root@osd-1 ~]# ceph mgr fail
>>
>> The MGR confirms, that the snap_schedule module is not available:
>>
>> 0|0[root@osd-1 ~]# journalctl -eu 
>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service>
>>
>> Jul 01 16:25:49 osd-1 bash[662895]: debug 
>> 2022-07-01T14:25:49.825+0000 7f0486408700  0 log_channel(audit) log 
>> [DBG] : from='client.90801080 -' entity='client.admin' 
>> cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive": 
>> true, "target": ["mon-mgr", ""]}]: dispatch
>> Jul 01 16:25:49 osd-1 bash[662895]: debug 
>> 2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply 
>> (2) No such file or directory Module 'snap_schedule' is not available
>>
>> But I'm not sure where the MGR is actually looking. The module path is:
>>
>> 0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path
>> /usr/share/ceph/mgr
>>
>> And while it is not available on the host (I assume these are just 
>> remnants from before our change to cephadm/docker, anyways):
>>
>> 0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr
>> ...
>> drwxr-xr-x. 4 root root   144 22. Sep 2021  restful
>> drwxr-xr-x. 3 root root    61 22. Sep 2021  selftest
>> drwxr-xr-x. 3 root root    61 22. Sep 2021  status
>> drwxr-xr-x. 3 root root   117 22. Sep 2021  telegraf
>> ...
>>
>> The module is available in the MGR container (which I assume is where 
>> the MGR would look):
>>
>> 0|0[root@osd-1 ~]# docker exec -it 
>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash
>> [root@osd-1 /]# ls -l /usr/share/ceph/mgr
>> ...
>> drwxr-xr-x.  4 root root    65 Jun 23 19:48 snap_schedule
>> ...
>>
>> The module was available before on Pacific which was also cephadm 
>> deployed.
>> Has anybody an idea how I can further investigate this?
>> Thanks again for all you help!
>>
>> Best Wishes,
>> Mathias
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx