On 25/03/2024 12:27, Max Gautier wrote:
On Mon, Mar 25, 2024 at 10:06:29AM +0000, phillip.wood123@xxxxxxxxx wrote:
Note that even if we really need more specific OnCalendar= settings for
each timer, we should still do it that way, but instead distribute
override alongside the template, for instance for weekly:
/usr/lib/systemd-user/git-maintenance@daily.timer.d/override.conf:
[Timer]
OnCalendar=<daily specific calendar spec>
We should definitely do that. Using systemd's random delay does not prevent
the different maintenance jobs from running concurrently as one job may be
started before a previous job has finished. It is important to only have one
job running at a time because the first thing "git maintenance run" does is
to try and acquire a lock file so if the hourly job is running when the
daily jobs tries to start the daily job will not be run.
Thinking about that, it occurs to me that the current scheme does not
prevent concurrent execution either: the timers all use Persistent=true,
which means they can fire concurrently on machine boot, if two or more
would have been triggered during the time the machine was powered off
(or just the user logged out, since it's a user unit).
Interesting, I wonder if the other schedulers suffer from the same problem.
From what I can find (didn't dig much):
Thanks for looking at this
- cron does not have the problem, because it will just miss the timers
if the machine was powered off. Not really better ^
Yes, skipping the jobs is not great. On debian at least the job will be
run if it is less than three hours since it should have been run. See
https://manpages.debian.org/bookworm/cron/cron.8.en.html
- anacron though is another implementation of cron which apparently
support that semantic and is the default on ubuntu [1]
I can't find if there is something to avoid the same problem that
Persitent=true imply
- same goes for launchctl (Effect of Sleeping and Powering Off at the
bottom of the page [2])
As I read it the job is rescheduled if the computer was asleep when it
should have run, but not if it was powered off.
- for schtasks it's apparently possible to have a similar mechanism than
Persistent [3]. There is a policy apparently to handle multiples
instances [4] but I'm not completely sure whether or not theses
instances can have different parameters.
It's currently defined that way for the schtasks scheduler:
"<MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\n". I
don't think it would prevent parallel execution between the different
schedule though, it seems to me they are different tasks.
[1]: https://serverfault.com/a/52338
[2]: https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/ScheduledJobs.html
[3]: https://learn.microsoft.com/en-us/troubleshoot/windows-server/system-management-components/scheduled-task-not-run-upon-reboot-machine-off
[4]: https://learn.microsoft.com/en-us/windows/win32/taskschd/tasksettings-multipleinstances
We should have a think about what to do about this once your patches to
move to system-wide unit files are merged. We'll need to come up with
something that works for all the schedulers so that we don't miss the
daily and weekly jobs when the computer is powered off and ensures we
don't run concurrent jobs.
Best Wishes
Phillip
So maybe there should be a more robust mechanism to avoid concurrent
execution ? I assume from what you say above the lock is acquired in a
non-blocking way. Could going to a blocking one be a solution ?
It is possible to wait on a lock file but I'd be worried about building up
an endless queue of processes if the process holding the lock file crashed
leaving it in place without anyway to automatically remove it.
At least with systemd we have some mechanisms to deal with that.
- systemd timers don't touch an already running unit, so that won't
trigger a new hourly or daily if the previous one is still running.
- for the automatically removing it, we could:
- use XDG_RUNTIME_DIR ("%t" in systemd units) which is removed on
logout
- optionally add a tmpfiles fragments to delete locks which are really
too old (tmpfiles won't delete files on which a lock is taken)
- I thought about using a common RuntimeDirectory (see systemd.exec),
but this is not possible due to [5]
[5]: https://github.com/systemd/systemd/issues/5394
I don't think we need to solve that problem as part of this patch series but
we should take care not to make it worse. Long term we may be better
scheduling a single job and have "git maintenance run" decide which jobs to
run based on the last time it run, rather than trying to schedule different
jobs with the os scheduler.
As the daily job is
a superset of the hourly job and the weekly job is a superset of the daily
job so it does not make sense to run more than one job per hour.
Is that set in stone, or could they perform disjoint set of tasks
instead ?
All of the schedulers are set up to run a single job each hour, I don't see
why we'd start running disjoint sets of tasks in the different jobs.
I was wondering if running distinct tasks would allow overlapping
execution, or if the different tasks are not safe to run concurrently
anyway. I'm not familiar enough with them and the git internals to tell.
Another option if the tasks set was distinct for each service instance
would be to use dependencies and ordering directives like this:
weekly.service
```
[Unit]
Requires=daily.service
After=daily.service
[Service]
ExecStart=<run only weekly stuff>
```
daily.service
```
[Unit]
Requires=hourly.service
After=hourly.service
[Service]
ExecStart=<run only daily stuff>
```
hourly.service
```
[Service]
ExecStart=<run only hourly stuff>
```
That would avoid concurrent execution I think.