Hello Mathias,
On 06.07.22 18:27, Kuhring, Mathias wrote:
Hey Andreas,
thanks for the info.
We also had our MGR reporting crashes related to the module.
We have a second cluster as mirror which we also updated to Quincy.
But there the MGR is able to use the snap_module (so "ceph fs
snap-schedule status" etc are not complaining).
And I'm able to schedule snapshots. But we didn't had any schedules
there before the upgrade (due to being the mirror).
I think in that case there is no RADOS object for the legacy schedule
DB, which is handled gracefully by the code.
I also noticed that this particular part of the code you mentioned
hasn't been touched in a year and half:
https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193
The relevant change was made 17 months ago but it was not backported to
Pacific and is only included in Quincy.
So I'm wondering if my previous schedule entries got somehow
incompatible with the new version.
The schedule entries are still the same. What changed is that the sqlite
DB that they are stored in, is no longer stored as a DB dump in a RADOS
object in the FS's metadata pool. Instead now the sqlite Ceph VFS driver
is used to store the DB in the metadata pool.
Do you know if there is any way to reset/cleanup the modules config /
database?
So remove all the previously scheduled snapshots but without using "fs
snap-schedule remove"?
We only have a handful of schedules which can easily be recreated.
So maybe a clean start would be at least workaround.
We could just solve the problem by deleting the legacy schedule DB after
the upgrade:
rados -p <FS metadata pool name> -N cephfs-snap-schedule rm snap_db_v0
Afterwards the mgr has to be restarted/failovered.
The schedules are still there afterwards because they have already been
migrated to the new DB.
Thanks to my colleague Chris Glaubitz for figuring out that the object
is in a separate namespace. :-)
Otherwise we will keep simple cron jobs until these issues are fixed.
After all, you just need regularly executed mkdir and rmdir to get you
started.
Best Wishes,
Mathias
Best regards,
Andreas
On 7/6/2022 5:05 PM, Andreas Teuchert wrote:
Hello Mathias and others,
I also ran into this problem after upgrading from 16.2.9 to 17.2.1.
Additionally I observed a health warning: "3 mgr modules have recently
crashed".
Those are actually two distinct crashes that are already in the tracker:
https://tracker.ceph.com/issues/56269 and
https://tracker.ceph.com/issues/56270
Considering that the crashes are in the snap_schedule module I assume
that they are the reason why the module is not available.
I can reproduce the crash in 56270 by failing over the mgr.
I believe that the faulty code causing the error is this line:
https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193
Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be
ioctx.remove_object(SNAP_DB_OBJECT_NAME).
(According to my understanding of
https://docs.ceph.com/en/latest/rados/api/python/.)
Best regards,
Andreas
On 01.07.22 18:05, Kuhring, Mathias wrote:
Dear Ceph community,
After upgrading our cluster to Quincy with cephadm (ceph orch upgrade
start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate
the snapshot schedule module:
0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule
0|1[root@osd-1 ~]# ceph mgr module ls | grep snap
snap_schedule on
0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive
Error ENOENT: Module 'snap_schedule' is not available
I tried restarting the MGR daemons and failed over a restarted one.
But with no change.
0|0[root@osd-1 ~]# ceph orch restart mgr
Scheduled to restart mgr.osd-1 on host 'osd-1'
Scheduled to restart mgr.osd-2 on host 'osd-2'
Scheduled to restart mgr.osd-3 on host 'osd-3'
Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4'
Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5'
0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr
NAME HOST PORTS STATUS REFRESHED AGE
MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mgr.osd-1 osd-1 *:8443,9283 running (61s) 35s ago 9M
402M - 17.2.1 e5af760fa1c1 64f7ec70a6aa
mgr.osd-2 osd-2 *:8443,9283 running (47s) 36s ago 9M
103M - 17.2.1 e5af760fa1c1 d25fdc793ff8
mgr.osd-3 osd-3 *:8443,9283 running (7h) 36s ago 9M
457M - 17.2.1 e5af760fa1c1 46d5091e50d6
mgr.osd-4.oylrhe osd-4 *:8443,9283 running (7h) 79s ago 9M
795M - 17.2.1 e5af760fa1c1 efb2a7cc06c5
mgr.osd-5.jcfyqe osd-5 *:8443,9283 running (8h) 37s ago 9M
448M - 17.2.1 e5af760fa1c1 96dd03817f32
0|0[root@osd-1 ~]# ceph mgr fail
The MGR confirms, that the snap_schedule module is not available:
0|0[root@osd-1 ~]# journalctl -eu
ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service<mailto:ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service>
Jul 01 16:25:49 osd-1 bash[662895]: debug
2022-07-01T14:25:49.825+0000 7f0486408700 0 log_channel(audit) log
[DBG] : from='client.90801080 -' entity='client.admin'
cmd=[{"prefix": "fs snap-schedule list", "path": "/", "recursive":
true, "target": ["mon-mgr", ""]}]: dispatch
Jul 01 16:25:49 osd-1 bash[662895]: debug
2022-07-01T14:25:49.825+0000 7f0486c09700 -1 mgr.server reply reply
(2) No such file or directory Module 'snap_schedule' is not available
But I'm not sure where the MGR is actually looking. The module path is:
0|22[root@osd-1 ~]# ceph config get mgr mgr_module_path
/usr/share/ceph/mgr
And while it is not available on the host (I assume these are just
remnants from before our change to cephadm/docker, anyways):
0|0[root@osd-1 ~]# ll /usr/share/ceph/mgr
...
drwxr-xr-x. 4 root root 144 22. Sep 2021 restful
drwxr-xr-x. 3 root root 61 22. Sep 2021 selftest
drwxr-xr-x. 3 root root 61 22. Sep 2021 status
drwxr-xr-x. 3 root root 117 22. Sep 2021 telegraf
...
The module is available in the MGR container (which I assume is where
the MGR would look):
0|0[root@osd-1 ~]# docker exec -it
ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f-mgr-osd-1 /bin/bash
[root@osd-1 /]# ls -l /usr/share/ceph/mgr
...
drwxr-xr-x. 4 root root 65 Jun 23 19:48 snap_schedule
...
The module was available before on Pacific which was also cephadm
deployed.
Has anybody an idea how I can further investigate this?
Thanks again for all you help!
Best Wishes,
Mathias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Andreas Teuchert
Systems Engineer Linux
SysEleven GmbH
Boxhagener Str. 80
10245 Berlin
T +49 30 233 2012 171
F +49 30 616 7555 0
https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/
https://www.twitter.com/SysEleven
Aktueller System-Status immer unter:
https://www.syseleven-status.net/
Firmensitz: Berlin
Registergericht: AG Berlin Charlottenburg, HRB 108571 Berlin
Geschäftsführer: Marc Korthaus, Jens Ihlenfeld, Andreas Hermann
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx