Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy

Andreas Teuchert <a.teuchert@xxxxxxxxxxxx> · Thu, 7 Jul 2022 11:46:37 +0200

Hello Mathias,

On 06.07.22 18:27, Kuhring, Mathias wrote:
Hey Andreas,

thanks for the info.

We also had our MGR reporting crashes related to the module.

We have a second cluster as mirror which we also updated to Quincy.
But there the MGR is able to use the snap_module (so "ceph fs
snap-schedule status" etc are not complaining).
And I'm able to schedule snapshots. But we didn't had any schedules
there before the upgrade (due to being the mirror).

I think in that case there is no RADOS object for the legacy schedule 
DB, which is handled gracefully by the code.

I also noticed that this particular part of the code you mentioned
hasn't been touched in  a year and half:
https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193

The relevant change was made 17 months ago but it was not backported to 
Pacific and is only included in Quincy.

So I'm wondering if my previous schedule entries got somehow
incompatible with the new version.

The schedule entries are still the same. What changed is that the sqlite 
DB that they are stored in, is no longer stored as a DB dump in a RADOS 
object in the FS's metadata pool. Instead now the sqlite Ceph VFS driver 
is used to store the DB in the metadata pool.

Do you know if there is any way to reset/cleanup the modules config /
database?
So remove all the previously scheduled snapshots but without using "fs
snap-schedule remove"?
We only have a handful of schedules which can easily be recreated.
So maybe a clean start would be at least workaround.

We could just solve the problem by deleting the legacy schedule DB after 
the upgrade:

rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0

Afterwards the mgr has to be restarted/failovered.

The schedules are still there afterwards because they have already been 
migrated to the new DB.

Thanks to my colleague Chris Glaubitz for figuring out that the object 
is in a separate namespace. :-)

Otherwise we will keep simple cron jobs until these issues are fixed.
After all, you just need regularly executed mkdir and rmdir to get you
started.

Best Wishes,
Mathias

Best regards,

Andreas

On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
Hello Mathias and others,

I also ran into this problem after upgrading from 16.2.9 to 17.2.1.

Additionally I observed a health warning: "3 mgr modules have recently
crashed".

Those are actually two distinct crashes that are already in the tracker:

https://tracker.ceph.com/issues/56269 and
https://tracker.ceph.com/issues/56270

Considering that the crashes are in the snap_schedule module I assume
that they are the reason why the module is not available.

I can reproduce the crash in 56270 by failing over the mgr.

I believe that the faulty code causing the error is this line:
https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193

Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be
ioctx.remove_object(SNAP_DB_OBJECT_NAME).

(According to my understanding of
https://docs.ceph.com/en/latest/rados/api/python/.)

Best regards,

Andreas

On 01.07.22 18:05, Kuhring, Mathias wrote:
Dear Ceph community,

After upgrading our cluster to Quincy with cephadm (ceph orch upgrade
start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate
the snapshot schedule module:

0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
snap_schedule         on

0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
Error ENOENT: Module 'snap_schedule' is not available

I tried restarting the MGR daemons and failed over a restarted one.
But with no change.

0|0[root@osd-1 ~]# ceph orch restart mgr
Scheduled to restart mgr.osd-1 on host 'osd-1'
Scheduled to restart mgr.osd-2 on host 'osd-2'
Scheduled to restart mgr.osd-3 on host 'osd-3'
Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'

0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
NAME              HOST   PORTS        STATUS         REFRESHED AGE
MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
mgr.osd-1         osd-1  *:8443,9283  running (61s)    35s ago 9M
402M        -  17.2.1   e5af760fa1c1  64f7ec70a6aa
mgr.osd-2         osd-2  *:8443,9283  running (47s)    36s ago 9M
103M        -  17.2.1   e5af760fa1c1  d25fdc793ff8
mgr.osd-3         osd-3  *:8443,9283  running (7h)     36s ago 9M
457M        -  17.2.1   e5af760fa1c1  46d5091e50d6
mgr.osd-4.oylrhe  osd-4  *:8443,9283  running (7h)     79s ago 9M
795M        -  17.2.1   e5af760fa1c1  efb2a7cc06c5
mgr.osd-5.jcfyqe  osd-5  *:8443,9283  running (8h)     37s ago 9M
448M        -  17.2.1   e5af760fa1c1  96dd03817f32

0|0[root@osd-1 ~]# ceph mgr fail

The MGR confirms, that the snap_schedule module is not available:

0|0[root@osd-1 ~]# journalctl -eu
ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service>

Jul 01 16:25:49 osd-1 bash[662895]: debug
2022-07-01T14:25:49.825+0000 7f0486408700  0 log_channel(audit) log
[DBG] : from='client.90801080 -' entity='client.admin'
cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive":
true, "target": ["mon-mgr", ""]}]: dispatch
Jul 01 16:25:49 osd-1 bash[662895]: debug
2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply
(2) No such file or directory Module 'snap_schedule' is not available

But I'm not sure where the MGR is actually looking. The module path is:

0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path
/usr/share/ceph/mgr

And while it is not available on the host (I assume these are just
remnants from before our change to cephadm/docker, anyways):

0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr
...
drwxr-xr-x. 4 root root   144 22. Sep 2021  restful
drwxr-xr-x. 3 root root    61 22. Sep 2021  selftest
drwxr-xr-x. 3 root root    61 22. Sep 2021  status
drwxr-xr-x. 3 root root   117 22. Sep 2021  telegraf
...

The module is available in the MGR container (which I assume is where
the MGR would look):

0|0[root@osd-1 ~]# docker exec -it
ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash
[root@osd-1 /]# ls -l /usr/share/ceph/mgr
...
drwxr-xr-x.  4 root root    65 Jun 23 19:48 snap_schedule
...

The module was available before on Pacific which was also cephadm
deployed.
Has anybody an idea how I can further investigate this?
Thanks again for all you help!

Best Wishes,
Mathias

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Andreas Teuchert
Systems Engineer Linux

SysEleven GmbH
Boxhagener Str. 80
10245 Berlin

T +49 30 233 2012 171
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/
https://www.twitter.com/SysEleven

Aktueller System-Status immer unter:
https://www.syseleven-status.net/

Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 Berlin
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx