Hello, I had the same problem (after upgrading a Proxmox cluster from Pacific to Quincy). Followed the instructions and I am happy to report that it worked for me too. Just wanted to add another data point. Thanks for the invaluable hints! George > On Jul 7, 2022, at 11:13 AM, Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx> wrote: > > Hey Andreas, > > Indeed, we were also possible to remove the legacy schedule DB > and the scheduler is now picking up the work again. > Wouldn't have known where to look for it. > Thanks for your help and all the details. I really appreciate it. > > Best, Mathias > > On 7/7/2022 11:46 AM, Andreas Teuchert wrote: >> Hello Mathias, >> >> On 06.07.22 18:27, Kuhring, Mathias wrote: >>> Hey Andreas, >>> >>> thanks for the info. >>> >>> We also had our MGR reporting crashes related to the module. >>> >>> We have a second cluster as mirror which we also updated to Quincy. >>> But there the MGR is able to use the snap_module (so "ceph fs >>> snap-schedule status" etc are not complaining). >>> And I'm able to schedule snapshots. But we didn't had any schedules >>> there before the upgrade (due to being the mirror). >> >> I think in that case there is no RADOS object for the legacy schedule >> DB, which is handled gracefully by the code. >> >>> >>> I also noticed that this particular part of the code you mentioned >>> hasn't been touched in a year and half: >>> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >>> >>> >> >> The relevant change was made 17 months ago but it was not backported >> to Pacific and is only included in Quincy. >> >>> So I'm wondering if my previous schedule entries got somehow >>> incompatible with the new version. >> >> The schedule entries are still the same. What changed is that the >> sqlite DB that they are stored in, is no longer stored as a DB dump in >> a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph >> VFS driver is used to store the DB in the metadata pool. >> >>> >>> Do you know if there is any way to reset/cleanup the modules config / >>> database? >>> So remove all the previously scheduled snapshots but without using "fs >>> snap-schedule remove"? >>> We only have a handful of schedules which can easily be recreated. >>> So maybe a clean start would be at least workaround. >> >> We could just solve the problem by deleting the legacy schedule DB >> after the upgrade: >> >> rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0 >> >> Afterwards the mgr has to be restarted/failovered. >> >> The schedules are still there afterwards because they have already >> been migrated to the new DB. >> >> Thanks to my colleague Chris Glaubitz for figuring out that the object >> is in a separate namespace. :-) >> >>> >>> Otherwise we will keep simple cron jobs until these issues are fixed. >>> After all, you just need regularly executed mkdir and rmdir to get you >>> started. >>> >>> Best Wishes, >>> Mathias >>> >>> >> >> >> Best regards, >> >> Andreas >> >>> >>> >>> >>> On 7/6/2022 5:05 PM, Andreas Teuchert wrote: >>>> Hello Mathias and others, >>>> >>>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1. >>>> >>>> Additionally I observed a health warning: "3 mgr modules have recently >>>> crashed". >>>> >>>> Those are actually two distinct crashes that are already in the >>>> tracker: >>>> >>>> https://tracker.ceph.com/issues/56269 and >>>> https://tracker.ceph.com/issues/56270 >>>> >>>> Considering that the crashes are in the snap_schedule module I assume >>>> that they are the reason why the module is not available. >>>> >>>> I can reproduce the crash in 56270 by failing over the mgr. >>>> >>>> I believe that the faulty code causing the error is this line: >>>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >>>> >>>> >>>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be >>>> ioctx.remove_object(SNAP_DB_OBJECT_NAME). >>>> >>>> (According to my understanding of >>>> https://docs.ceph.com/en/latest/rados/api/python/.) >>>> >>>> Best regards, >>>> >>>> Andreas >>>> >>>> >>>> On 01.07.22 18:05, Kuhring, Mathias wrote: >>>>> Dear Ceph community, >>>>> >>>>> After upgrading our cluster to Quincy with cephadm (ceph orch upgrade >>>>> start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate >>>>> the snapshot schedule module: >>>>> >>>>> 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule >>>>> 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap >>>>> snap_schedule on >>>>> >>>>> 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive >>>>> Error ENOENT: Module 'snap_schedule' is not available >>>>> >>>>> I tried restarting the MGR daemons and failed over a restarted one. >>>>> But with no change. >>>>> >>>>> 0|0[root@osd-1 ~]# ceph orch restart mgr >>>>> Scheduled to restart mgr.osd-1 on host 'osd-1' >>>>> Scheduled to restart mgr.osd-2 on host 'osd-2' >>>>> Scheduled to restart mgr.osd-3 on host 'osd-3' >>>>> Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4' >>>>> Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5' >>>>> >>>>> 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr >>>>> NAME HOST PORTS STATUS REFRESHED AGE >>>>> MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID >>>>> mgr.osd-1 osd-1 *:8443,9283 running (61s) 35s ago 9M >>>>> 402M - 17.2.1 e5af760fa1c1 64f7ec70a6aa >>>>> mgr.osd-2 osd-2 *:8443,9283 running (47s) 36s ago 9M >>>>> 103M - 17.2.1 e5af760fa1c1 d25fdc793ff8 >>>>> mgr.osd-3 osd-3 *:8443,9283 running (7h) 36s ago 9M >>>>> 457M - 17.2.1 e5af760fa1c1 46d5091e50d6 >>>>> mgr.osd-4.oylrhe osd-4 *:8443,9283 running (7h) 79s ago 9M >>>>> 795M - 17.2.1 e5af760fa1c1 efb2a7cc06c5 >>>>> mgr.osd-5.jcfyqe osd-5 *:8443,9283 running (8h) 37s ago 9M >>>>> 448M - 17.2.1 e5af760fa1c1 96dd03817f32 >>>>> >>>>> 0|0[root@osd-1 ~]# ceph mgr fail >>>>> >>>>> The MGR confirms, that the snap_schedule module is not available: >>>>> >>>>> 0|0[root@osd-1 ~]# journalctl -eu >>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service> >>>>> >>>>> >>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug >>>>> 2022-07-01T14:25:49.825+0000 7f0486408700 0 log_channel(audit) log >>>>> [DBG] : from='client.90801080 -' entity='client.admin' >>>>> cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive": >>>>> true, "target": ["mon-mgr", ""]}]: dispatch >>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug >>>>> 2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply >>>>> (2) No such file or directory Module 'snap_schedule' is not available >>>>> >>>>> But I'm not sure where the MGR is actually looking. The module path >>>>> is: >>>>> >>>>> 0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path >>>>> /usr/share/ceph/mgr >>>>> >>>>> And while it is not available on the host (I assume these are just >>>>> remnants from before our change to cephadm/docker, anyways): >>>>> >>>>> 0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr >>>>> ... >>>>> drwxr-xr-x. 4 root root 144 22. Sep 2021 restful >>>>> drwxr-xr-x. 3 root root 61 22. Sep 2021 selftest >>>>> drwxr-xr-x. 3 root root 61 22. Sep 2021 status >>>>> drwxr-xr-x. 3 root root 117 22. Sep 2021 telegraf >>>>> ... >>>>> >>>>> The module is available in the MGR container (which I assume is where >>>>> the MGR would look): >>>>> >>>>> 0|0[root@osd-1 ~]# docker exec -it >>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash >>>>> [root@osd-1 /]# ls -l /usr/share/ceph/mgr >>>>> ... >>>>> drwxr-xr-x. 4 root root 65 Jun 23 19:48 snap_schedule >>>>> ... >>>>> >>>>> The module was available before on Pacific which was also cephadm >>>>> deployed. >>>>> Has anybody an idea how I can further investigate this? >>>>> Thanks again for all you help! >>>>> >>>>> Best Wishes, >>>>> Mathias >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> > -- > Mathias Kuhring > > Dr. rer. nat. > Bioinformatician > HPC & Core Unit Bioinformatics > Berlin Institute of Health at Charité (BIH) > > E-Mail: mathias.kuhring@xxxxxxxxxxxxxx > Mobile: +49 172 3475576 > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx