Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy

"Kyriazis, George" <george.kyriazis@xxxxxxxxx> · Tue, 2 Aug 2022 00:56:03 +0000

Hello,

I had the same problem (after upgrading a Proxmox cluster from Pacific to Quincy).  Followed the instructions and I am happy to report that it worked for me too.  Just wanted to add another data point.

Thanks for the invaluable hints!

George

> On Jul 7, 2022, at 11:13 AM, Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx> wrote:
> 
> Hey Andreas,
> 
> Indeed, we were also possible to remove the legacy schedule DB
> and the scheduler is now picking up the work again.
> Wouldn't have known where to look for it.
> Thanks for your help and all the details. I really appreciate it.
> 
> Best, Mathias
> 
> On 7/7/2022 11:46 AM, Andreas Teuchert wrote:
>> Hello Mathias,
>> 
>> On 06.07.22 18:27, Kuhring, Mathias wrote:
>>> Hey Andreas,
>>> 
>>> thanks for the info.
>>> 
>>> We also had our MGR reporting crashes related to the module.
>>> 
>>> We have a second cluster as mirror which we also updated to Quincy.
>>> But there the MGR is able to use the snap_module (so "ceph fs
>>> snap-schedule status" etc are not complaining).
>>> And I'm able to schedule snapshots. But we didn't had any schedules
>>> there before the upgrade (due to being the mirror).
>> 
>> I think in that case there is no RADOS object for the legacy schedule 
>> DB, which is handled gracefully by the code.
>> 
>>> 
>>> I also noticed that this particular part of the code you mentioned
>>> hasn't been touched in  a year and half:
>>> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 
>>> 
>>> 
>> 
>> The relevant change was made 17 months ago but it was not backported 
>> to Pacific and is only included in Quincy.
>> 
>>> So I'm wondering if my previous schedule entries got somehow
>>> incompatible with the new version.
>> 
>> The schedule entries are still the same. What changed is that the 
>> sqlite DB that they are stored in, is no longer stored as a DB dump in 
>> a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph 
>> VFS driver is used to store the DB in the metadata pool.
>> 
>>> 
>>> Do you know if there is any way to reset/cleanup the modules config /
>>> database?
>>> So remove all the previously scheduled snapshots but without using "fs
>>> snap-schedule remove"?
>>> We only have a handful of schedules which can easily be recreated.
>>> So maybe a clean start would be at least workaround.
>> 
>> We could just solve the problem by deleting the legacy schedule DB 
>> after the upgrade:
>> 
>> rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0
>> 
>> Afterwards the mgr has to be restarted/failovered.
>> 
>> The schedules are still there afterwards because they have already 
>> been migrated to the new DB.
>> 
>> Thanks to my colleague Chris Glaubitz for figuring out that the object 
>> is in a separate namespace. :-)
>> 
>>> 
>>> Otherwise we will keep simple cron jobs until these issues are fixed.
>>> After all, you just need regularly executed mkdir and rmdir to get you
>>> started.
>>> 
>>> Best Wishes,
>>> Mathias
>>> 
>>> 
>> 
>> 
>> Best regards,
>> 
>> Andreas
>> 
>>> 
>>> 
>>> 
>>> On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
>>>> Hello Mathias and others,
>>>> 
>>>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1.
>>>> 
>>>> Additionally I observed a health warning: "3 mgr modules have recently
>>>> crashed".
>>>> 
>>>> Those are actually two distinct crashes that are already in the 
>>>> tracker:
>>>> 
>>>> https://tracker.ceph.com/issues/56269 and
>>>> https://tracker.ceph.com/issues/56270
>>>> 
>>>> Considering that the crashes are in the snap_schedule module I assume
>>>> that they are the reason why the module is not available.
>>>> 
>>>> I can reproduce the crash in 56270 by failing over the mgr.
>>>> 
>>>> I believe that the faulty code causing the error is this line:
>>>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 
>>>> 
>>>> 
>>>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be
>>>> ioctx.remove_object(SNAP_DB_OBJECT_NAME).
>>>> 
>>>> (According to my understanding of
>>>> https://docs.ceph.com/en/latest/rados/api/python/.)
>>>> 
>>>> Best regards,
>>>> 
>>>> Andreas
>>>> 
>>>> 
>>>> On 01.07.22 18:05, Kuhring, Mathias wrote:
>>>>> Dear Ceph community,
>>>>> 
>>>>> After upgrading our cluster to Quincy with cephadm (ceph orch upgrade
>>>>> start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate
>>>>> the snapshot schedule module:
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
>>>>> 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
>>>>> snap_schedule         on
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
>>>>> Error ENOENT: Module 'snap_schedule' is not available
>>>>> 
>>>>> I tried restarting the MGR daemons and failed over a restarted one.
>>>>> But with no change.
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ceph orch restart mgr
>>>>> Scheduled to restart mgr.osd-1 on host 'osd-1'
>>>>> Scheduled to restart mgr.osd-2 on host 'osd-2'
>>>>> Scheduled to restart mgr.osd-3 on host 'osd-3'
>>>>> Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
>>>>> Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
>>>>> NAME              HOST   PORTS        STATUS REFRESHED AGE
>>>>> MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
>>>>> mgr.osd-1         osd-1  *:8443,9283  running (61s)    35s ago 9M
>>>>> 402M        -  17.2.1   e5af760fa1c1  64f7ec70a6aa
>>>>> mgr.osd-2         osd-2  *:8443,9283  running (47s)    36s ago 9M
>>>>> 103M        -  17.2.1   e5af760fa1c1  d25fdc793ff8
>>>>> mgr.osd-3         osd-3  *:8443,9283  running (7h)     36s ago 9M
>>>>> 457M        -  17.2.1   e5af760fa1c1  46d5091e50d6
>>>>> mgr.osd-4.oylrhe  osd-4  *:8443,9283  running (7h)     79s ago 9M
>>>>> 795M        -  17.2.1   e5af760fa1c1  efb2a7cc06c5
>>>>> mgr.osd-5.jcfyqe  osd-5  *:8443,9283  running (8h)     37s ago 9M
>>>>> 448M        -  17.2.1   e5af760fa1c1  96dd03817f32
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ceph mgr fail
>>>>> 
>>>>> The MGR confirms, that the snap_schedule module is not available:
>>>>> 
>>>>> 0|0[root@osd-1 ~]# journalctl -eu
>>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service> 
>>>>> 
>>>>> 
>>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug
>>>>> 2022-07-01T14:25:49.825+0000 7f0486408700  0 log_channel(audit) log
>>>>> [DBG] : from='client.90801080 -' entity='client.admin'
>>>>> cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive":
>>>>> true, "target": ["mon-mgr", ""]}]: dispatch
>>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug
>>>>> 2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply
>>>>> (2) No such file or directory Module 'snap_schedule' is not available
>>>>> 
>>>>> But I'm not sure where the MGR is actually looking. The module path 
>>>>> is:
>>>>> 
>>>>> 0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path
>>>>> /usr/share/ceph/mgr
>>>>> 
>>>>> And while it is not available on the host (I assume these are just
>>>>> remnants from before our change to cephadm/docker, anyways):
>>>>> 
>>>>> 0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr
>>>>> ...
>>>>> drwxr-xr-x. 4 root root   144 22. Sep 2021  restful
>>>>> drwxr-xr-x. 3 root root    61 22. Sep 2021  selftest
>>>>> drwxr-xr-x. 3 root root    61 22. Sep 2021  status
>>>>> drwxr-xr-x. 3 root root   117 22. Sep 2021  telegraf
>>>>> ...
>>>>> 
>>>>> The module is available in the MGR container (which I assume is where
>>>>> the MGR would look):
>>>>> 
>>>>> 0|0[root@osd-1 ~]# docker exec -it
>>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash
>>>>> [root@osd-1 /]# ls -l /usr/share/ceph/mgr
>>>>> ...
>>>>> drwxr-xr-x.  4 root root    65 Jun 23 19:48 snap_schedule
>>>>> ...
>>>>> 
>>>>> The module was available before on Pacific which was also cephadm
>>>>> deployed.
>>>>> Has anybody an idea how I can further investigate this?
>>>>> Thanks again for all you help!
>>>>> 
>>>>> Best Wishes,
>>>>> Mathias
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>> 
> -- 
> Mathias Kuhring
> 
> Dr. rer. nat.
> Bioinformatician
> HPC & Core Unit Bioinformatics
> Berlin Institute of Health at Charité (BIH)
> 
> E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
> Mobile: +49 172 3475576
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx