Hey Andreas, Indeed, we were also possible to remove the legacy schedule DB and the scheduler is now picking up the work again. Wouldn't have known where to look for it. Thanks for your help and all the details. I really appreciate it. Best, Mathias On 7/7/2022 11:46 AM, Andreas Teuchert wrote: > Hello Mathias, > > On 06.07.22 18:27, Kuhring, Mathias wrote: >> Hey Andreas, >> >> thanks for the info. >> >> We also had our MGR reporting crashes related to the module. >> >> We have a second cluster as mirror which we also updated to Quincy. >> But there the MGR is able to use the snap_module (so "ceph fs >> snap-schedule status" etc are not complaining). >> And I'm able to schedule snapshots. But we didn't had any schedules >> there before the upgrade (due to being the mirror). > > I think in that case there is no RADOS object for the legacy schedule > DB, which is handled gracefully by the code. > >> >> I also noticed that this particular part of the code you mentioned >> hasn't been touched in a year and half: >> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >> >> > > The relevant change was made 17 months ago but it was not backported > to Pacific and is only included in Quincy. > >> So I'm wondering if my previous schedule entries got somehow >> incompatible with the new version. > > The schedule entries are still the same. What changed is that the > sqlite DB that they are stored in, is no longer stored as a DB dump in > a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph > VFS driver is used to store the DB in the metadata pool. > >> >> Do you know if there is any way to reset/cleanup the modules config / >> database? >> So remove all the previously scheduled snapshots but without using "fs >> snap-schedule remove"? >> We only have a handful of schedules which can easily be recreated. >> So maybe a clean start would be at least workaround. > > We could just solve the problem by deleting the legacy schedule DB > after the upgrade: > > rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0 > > Afterwards the mgr has to be restarted/failovered. > > The schedules are still there afterwards because they have already > been migrated to the new DB. > > Thanks to my colleague Chris Glaubitz for figuring out that the object > is in a separate namespace. :-) > >> >> Otherwise we will keep simple cron jobs until these issues are fixed. >> After all, you just need regularly executed mkdir and rmdir to get you >> started. >> >> Best Wishes, >> Mathias >> >> > > > Best regards, > > Andreas > >> >> >> >> On 7/6/2022 5:05 PM, Andreas Teuchert wrote: >>> Hello Mathias and others, >>> >>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1. >>> >>> Additionally I observed a health warning: "3 mgr modules have recently >>> crashed". >>> >>> Those are actually two distinct crashes that are already in the >>> tracker: >>> >>> https://tracker.ceph.com/issues/56269 and >>> https://tracker.ceph.com/issues/56270 >>> >>> Considering that the crashes are in the snap_schedule module I assume >>> that they are the reason why the module is not available. >>> >>> I can reproduce the crash in 56270 by failing over the mgr. >>> >>> I believe that the faulty code causing the error is this line: >>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >>> >>> >>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be >>> ioctx.remove_object(SNAP_DB_OBJECT_NAME). >>> >>> (According to my understanding of >>> https://docs.ceph.com/en/latest/rados/api/python/.) >>> >>> Best regards, >>> >>> Andreas >>> >>> >>> On 01.07.22 18:05, Kuhring, Mathias wrote: >>>> Dear Ceph community, >>>> >>>> After upgrading our cluster to Quincy with cephadm (ceph orch upgrade >>>> start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate >>>> the snapshot schedule module: >>>> >>>> 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule >>>> 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap >>>> snap_schedule on >>>> >>>> 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive >>>> Error ENOENT: Module 'snap_schedule' is not available >>>> >>>> I tried restarting the MGR daemons and failed over a restarted one. >>>> But with no change. >>>> >>>> 0|0[root@osd-1 ~]# ceph orch restart mgr >>>> Scheduled to restart mgr.osd-1 on host 'osd-1' >>>> Scheduled to restart mgr.osd-2 on host 'osd-2' >>>> Scheduled to restart mgr.osd-3 on host 'osd-3' >>>> Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4' >>>> Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5' >>>> >>>> 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr >>>> NAME HOST PORTS STATUS REFRESHED AGE >>>> MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID >>>> mgr.osd-1 osd-1 *:8443,9283 running (61s) 35s ago 9M >>>> 402M - 17.2.1 e5af760fa1c1 64f7ec70a6aa >>>> mgr.osd-2 osd-2 *:8443,9283 running (47s) 36s ago 9M >>>> 103M - 17.2.1 e5af760fa1c1 d25fdc793ff8 >>>> mgr.osd-3 osd-3 *:8443,9283 running (7h) 36s ago 9M >>>> 457M - 17.2.1 e5af760fa1c1 46d5091e50d6 >>>> mgr.osd-4.oylrhe osd-4 *:8443,9283 running (7h) 79s ago 9M >>>> 795M - 17.2.1 e5af760fa1c1 efb2a7cc06c5 >>>> mgr.osd-5.jcfyqe osd-5 *:8443,9283 running (8h) 37s ago 9M >>>> 448M - 17.2.1 e5af760fa1c1 96dd03817f32 >>>> >>>> 0|0[root@osd-1 ~]# ceph mgr fail >>>> >>>> The MGR confirms, that the snap_schedule module is not available: >>>> >>>> 0|0[root@osd-1 ~]# journalctl -eu >>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service> >>>> >>>> >>>> Jul 01 16:25:49 osd-1 bash[662895]: debug >>>> 2022-07-01T14:25:49.825+0000 7f0486408700 0 log_channel(audit) log >>>> [DBG] : from='client.90801080 -' entity='client.admin' >>>> cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive": >>>> true, "target": ["mon-mgr", ""]}]: dispatch >>>> Jul 01 16:25:49 osd-1 bash[662895]: debug >>>> 2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply >>>> (2) No such file or directory Module 'snap_schedule' is not available >>>> >>>> But I'm not sure where the MGR is actually looking. The module path >>>> is: >>>> >>>> 0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path >>>> /usr/share/ceph/mgr >>>> >>>> And while it is not available on the host (I assume these are just >>>> remnants from before our change to cephadm/docker, anyways): >>>> >>>> 0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr >>>> ... >>>> drwxr-xr-x. 4 root root 144 22. Sep 2021 restful >>>> drwxr-xr-x. 3 root root 61 22. Sep 2021 selftest >>>> drwxr-xr-x. 3 root root 61 22. Sep 2021 status >>>> drwxr-xr-x. 3 root root 117 22. Sep 2021 telegraf >>>> ... >>>> >>>> The module is available in the MGR container (which I assume is where >>>> the MGR would look): >>>> >>>> 0|0[root@osd-1 ~]# docker exec -it >>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash >>>> [root@osd-1 /]# ls -l /usr/share/ceph/mgr >>>> ... >>>> drwxr-xr-x. 4 root root 65 Jun 23 19:48 snap_schedule >>>> ... >>>> >>>> The module was available before on Pacific which was also cephadm >>>> deployed. >>>> Has anybody an idea how I can further investigate this? >>>> Thanks again for all you help! >>>> >>>> Best Wishes, >>>> Mathias >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > -- Mathias Kuhring Dr. rer. nat. Bioinformatician HPC & Core Unit Bioinformatics Berlin Institute of Health at Charité (BIH) E-Mail: mathias.kuhring@xxxxxxxxxxxxxx Mobile: +49 172 3475576 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx