Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Thu, 7 Jul 2022 16:13:17 +0000

Hey Andreas,

Indeed, we were also possible to remove the legacy schedule DB
and the scheduler is now picking up the work again.
Wouldn't have known where to look for it.
Thanks for your help and all the details. I really appreciate it.

Best, Mathias

On 7/7/2022 11:46 AM, Andreas Teuchert wrote:
> Hello Mathias,
>
> On 06.07.22 18:27, Kuhring, Mathias wrote:
>> Hey Andreas,
>>
>> thanks for the info.
>>
>> We also had our MGR reporting crashes related to the module.
>>
>> We have a second cluster as mirror which we also updated to Quincy.
>> But there the MGR is able to use the snap_module (so "ceph fs
>> snap-schedule status" etc are not complaining).
>> And I'm able to schedule snapshots. But we didn't had any schedules
>> there before the upgrade (due to being the mirror).
>
> I think in that case there is no RADOS object for the legacy schedule 
> DB, which is handled gracefully by the code.
>
>>
>> I also noticed that this particular part of the code you mentioned
>> hasn't been touched in  a year and half:
>> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 
>>
>>
>
> The relevant change was made 17 months ago but it was not backported 
> to Pacific and is only included in Quincy.
>
>> So I'm wondering if my previous schedule entries got somehow
>> incompatible with the new version.
>
> The schedule entries are still the same. What changed is that the 
> sqlite DB that they are stored in, is no longer stored as a DB dump in 
> a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph 
> VFS driver is used to store the DB in the metadata pool.
>
>>
>> Do you know if there is any way to reset/cleanup the modules config /
>> database?
>> So remove all the previously scheduled snapshots but without using "fs
>> snap-schedule remove"?
>> We only have a handful of schedules which can easily be recreated.
>> So maybe a clean start would be at least workaround.
>
> We could just solve the problem by deleting the legacy schedule DB 
> after the upgrade:
>
> rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0
>
> Afterwards the mgr has to be restarted/failovered.
>
> The schedules are still there afterwards because they have already 
> been migrated to the new DB.
>
> Thanks to my colleague Chris Glaubitz for figuring out that the object 
> is in a separate namespace. :-)
>
>>
>> Otherwise we will keep simple cron jobs until these issues are fixed.
>> After all, you just need regularly executed mkdir and rmdir to get you
>> started.
>>
>> Best Wishes,
>> Mathias
>>
>>
>
>
> Best regards,
>
> Andreas
>
>>
>>
>>
>> On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
>>> Hello Mathias and others,
>>>
>>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1.
>>>
>>> Additionally I observed a health warning: "3 mgr modules have recently
>>> crashed".
>>>
>>> Those are actually two distinct crashes that are already in the 
>>> tracker:
>>>
>>> https://tracker.ceph.com/issues/56269 and
>>> https://tracker.ceph.com/issues/56270
>>>
>>> Considering that the crashes are in the snap_schedule module I assume
>>> that they are the reason why the module is not available.
>>>
>>> I can reproduce the crash in 56270 by failing over the mgr.
>>>
>>> I believe that the faulty code causing the error is this line:
>>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 
>>>
>>>
>>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be
>>> ioctx.remove_object(SNAP_DB_OBJECT_NAME).
>>>
>>> (According to my understanding of
>>> https://docs.ceph.com/en/latest/rados/api/python/.)
>>>
>>> Best regards,
>>>
>>> Andreas
>>>
>>>
>>> On 01.07.22 18:05, Kuhring, Mathias wrote:
>>>> Dear Ceph community,
>>>>
>>>> After upgrading our cluster to Quincy with cephadm (ceph orch upgrade
>>>> start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate
>>>> the snapshot schedule module:
>>>>
>>>> 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
>>>> 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
>>>> snap_schedule         on
>>>>
>>>> 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
>>>> Error ENOENT: Module 'snap_schedule' is not available
>>>>
>>>> I tried restarting the MGR daemons and failed over a restarted one.
>>>> But with no change.
>>>>
>>>> 0|0[root@osd-1 ~]# ceph orch restart mgr
>>>> Scheduled to restart mgr.osd-1 on host 'osd-1'
>>>> Scheduled to restart mgr.osd-2 on host 'osd-2'
>>>> Scheduled to restart mgr.osd-3 on host 'osd-3'
>>>> Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
>>>> Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'
>>>>
>>>> 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
>>>> NAME              HOST   PORTS        STATUS REFRESHED AGE
>>>> MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
>>>> mgr.osd-1         osd-1  *:8443,9283  running (61s)    35s ago 9M
>>>> 402M        -  17.2.1   e5af760fa1c1  64f7ec70a6aa
>>>> mgr.osd-2         osd-2  *:8443,9283  running (47s)    36s ago 9M
>>>> 103M        -  17.2.1   e5af760fa1c1  d25fdc793ff8
>>>> mgr.osd-3         osd-3  *:8443,9283  running (7h)     36s ago 9M
>>>> 457M        -  17.2.1   e5af760fa1c1  46d5091e50d6
>>>> mgr.osd-4.oylrhe  osd-4  *:8443,9283  running (7h)     79s ago 9M
>>>> 795M        -  17.2.1   e5af760fa1c1  efb2a7cc06c5
>>>> mgr.osd-5.jcfyqe  osd-5  *:8443,9283  running (8h)     37s ago 9M
>>>> 448M        -  17.2.1   e5af760fa1c1  96dd03817f32
>>>>
>>>> 0|0[root@osd-1 ~]# ceph mgr fail
>>>>
>>>> The MGR confirms, that the snap_schedule module is not available:
>>>>
>>>> 0|0[root@osd-1 ~]# journalctl -eu
>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service> 
>>>>
>>>>
>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug
>>>> 2022-07-01T14:25:49.825+0000 7f0486408700  0 log_channel(audit) log
>>>> [DBG] : from='client.90801080 -' entity='client.admin'
>>>> cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive":
>>>> true, "target": ["mon-mgr", ""]}]: dispatch
>>>> Jul 01 16:25:49 osd-1 bash[662895]: debug
>>>> 2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply
>>>> (2) No such file or directory Module 'snap_schedule' is not available
>>>>
>>>> But I'm not sure where the MGR is actually looking. The module path 
>>>> is:
>>>>
>>>> 0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path
>>>> /usr/share/ceph/mgr
>>>>
>>>> And while it is not available on the host (I assume these are just
>>>> remnants from before our change to cephadm/docker, anyways):
>>>>
>>>> 0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr
>>>> ...
>>>> drwxr-xr-x. 4 root root   144 22. Sep 2021  restful
>>>> drwxr-xr-x. 3 root root    61 22. Sep 2021  selftest
>>>> drwxr-xr-x. 3 root root    61 22. Sep 2021  status
>>>> drwxr-xr-x. 3 root root   117 22. Sep 2021  telegraf
>>>> ...
>>>>
>>>> The module is available in the MGR container (which I assume is where
>>>> the MGR would look):
>>>>
>>>> 0|0[root@osd-1 ~]# docker exec -it
>>>> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash
>>>> [root@osd-1 /]# ls -l /usr/share/ceph/mgr
>>>> ...
>>>> drwxr-xr-x.  4 root root    65 Jun 23 19:48 snap_schedule
>>>> ...
>>>>
>>>> The module was available before on Pacific which was also cephadm
>>>> deployed.
>>>> Has anybody an idea how I can further investigate this?
>>>> Thanks again for all you help!
>>>>
>>>> Best Wishes,
>>>> Mathias
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx
Mobile: +49 172 3475576

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx