Re: CephFS Snapshot Scheduling stops creating Snapshots after a restart of the Manager

Venky Shankar <vshankar@xxxxxxxxxx> · Fri, 28 Jan 2022 10:26:55 +0530

Hey Sebastian,

On Thu, Jan 27, 2022 at 6:06 AM Sebastian Mazza <sebastian@xxxxxxxxxxx> wrote:
>
> I have a problem with the snap_schedule MGR module. It seams to forget at least parts of the configuration after the active MGR is restarted.
> The following cli commands (lines starting with ‘$’) and their std out (lines starting with >) demonstrates the problem.
>
> $ ceph fs snap-schedule add /shares/users 1h 2021-10-31T18:00
> > Schedule set for path /shares/users
>
> $ ceph fs snap-schedule retention add /shares/users 14h10d12m
> > Retention added to path /shares/users
>
> Wait until the next complete hour.
>
> $ ceph fs snap-schedule status /shares/users
> > {"fs": "cephfs", "subvol": null, "path": "/shares/users", "rel_path": "/shares/users", "schedule": "1h", "retention": {"h": 14, "d": 10, "m": 12}, "start": "2021-10-31T18:00:00", "created": "2022-01-26T23:52:03", "first": "2022-01-27T00:00:00", "last": "2022-01-27T00:00:00", "last_pruned": "2022-01-27T00:00:00", "created_count": 1, "pruned_count": 1, "active": true}
>
> Now everything looks and works as expected. However, if I restart the active MGR, no new snapshots will be created and the status command does unexpectedly report NULL for some of the properties.
>
> $ systemctl restart ceph-mgr@apollon.service
>
> $ ceph fs snap-schedule status /shares/users
> > {"fs": "cephfs", "subvol": null, "path": "/shares/users", "rel_path": "/shares/users", "schedule": "1h", "retention": {}, "start": "2021-10-31T18:00:00", "created": "2022-01-26T23:52:03", "first": null, "last": null, "last_pruned": null, "created_count": 0, "pruned_count": 0, "active": true}

That looks like a bug. Another similar issue is reported here:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/7K4T2HI72NJPB6UWEMZAYEUN4MORBL6O/

Could you please file a tracker here:
https://tracker.ceph.com/projects/cephfs/issues/new

It would help if you could enable debug log for ceph-mgr, repeat the
steps you mention above and upload the log in the tracker.

>
>
> I did look into the source file mgr/snap_schedule/fs/schedule.py. Since, I never used python I do not understand much, but I understand the SQL code that is given.
> Therefore, I did save the sqlight DB dump before and after a MGR restart by the following commands:
>
> List RADOS objects in order to find the sqlight DB dump:
> $ rados --pool fs.metadata-root-pool --namespace cephfs-snap-schedule ls
> > snap_db_v0
>
> Copy the sqlight DB dump into a regular file
> $ rados --pool fs.metadata-root-pool --namespace cephfs-snap-schedule get snap_db_v0 /tmp/snap_db_v0
>
> To my surprise, the sqlight DB dump never contains the information for retention, first, last, and last_pruned.
> The sqlight DB dump always looks like this:
> ————————————————
> BEGIN TRANSACTION;
> CREATE TABLE schedules(
>         id INTEGER PRIMARY KEY ASC,
>         path TEXT NOT NULL UNIQUE,
>         subvol TEXT,
>         retention TEXT DEFAULT '{}',
>         rel_path TEXT NOT NULL
>     );
> INSERT INTO "schedules" VALUES(2,'/shares/groups',NULL,'{}','/shares/groups');
> INSERT INTO "schedules" VALUES(3,'/shares/backup-clients',NULL,'{}','/shares/backup-clients');
> INSERT INTO "schedules" VALUES(4,'/shares/users',NULL,'{}','/shares/users');
> CREATE TABLE schedules_meta(
>         id INTEGER PRIMARY KEY ASC,
>         schedule_id INT,
>         start TEXT NOT NULL,
>         first TEXT,
>         last TEXT,
>         last_pruned TEXT,
>         created TEXT NOT NULL,
>         repeat INT NOT NULL,
>         schedule TEXT NOT NULL,
>         created_count INT DEFAULT 0,
>         pruned_count INT DEFAULT 0,
>         active INT NOT NULL,
>         FOREIGN KEY(schedule_id) REFERENCES schedules(id) ON DELETE CASCADE,
>         UNIQUE (schedule_id, start, repeat)
>     );
> INSERT INTO "schedules_meta" VALUES(2,2,'2021-10-31T18:00:00',NULL,NULL,NULL,'2022-01-21T11:41:35',3600,'1h',0,0,1);
> INSERT INTO "schedules_meta" VALUES(3,3,'2021-10-31T13:30:00',NULL,NULL,NULL,'2022-01-21T11:41:41',21600,'6h',0,0,1);
> INSERT INTO "schedules_meta" VALUES(4,4,'2021-10-31T18:00:00',NULL,NULL,NULL,'2022-01-26T23:52:03',3600,'1h',0,0,1);
> COMMIT;
> ————————————————
>
> Why are the information about retention, first, last, and last_pruned are not part of the sqlight dump?

I expect it to be a part of the above query. Most likely it's a bug,

> Is this the reason why my snapshot scheduling stops working after the active MGR is restarted?
>
>
> My ceph version is: 16.2.6
>
>
> Thanks is advance,
> Sebastian
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx