[PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a second attempt at redesigning Git's repository maintenance
patterns. The first attempt [1] included a way to run jobs in the background
using a long-lived process; that idea was rejected and is not included in
this series. A future series will use the OS to handle scheduling tasks.

[1] 
https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@xxxxxxxxx/

As mentioned before, git gc already plays the role of maintaining Git
repositories. It has accumulated several smaller pieces in its long history,
including:

 1. Repacking all reachable objects into one pack-file (and deleting
    unreachable objects).
 2. Packing refs.
 3. Expiring reflogs.
 4. Clearing rerere logs.
 5. Updating the commit-graph file.

While expiring reflogs, clearing rererelogs, and deleting unreachable
objects are suitable under the guise of "garbage collection", packing refs
and updating the commit-graph file are not as obviously fitting. Further,
these operations are "all or nothing" in that they rewrite almost all
repository data, which does not perform well at extremely large scales.
These operations can also be disruptive to foreground Git commands when git
gc --auto triggers during routine use.

This series does not intend to change what git gc does, but instead create
new choices for automatic maintenance activities, of which git gc remains
the only one enabled by default.

The new maintenance tasks are:

 * 'commit-graph' : write and verify a single layer of an incremental
   commit-graph.
 * 'loose-objects' : prune packed loose objects, then create a new pack from
   a batch of loose objects.
 * 'pack-files' : expire redundant packs from the multi-pack-index, then
   repack using the multi-pack-index's incremental repack strategy.
 * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.

These tasks are all disabled by default, but can be enabled with config
options or run explicitly using "git maintenance run --task=". There are
additional config options to allow customizing the conditions for which the
tasks run during the '--auto' option. ('fetch' will never run with the
'--auto' option.)

 Because 'gc' is implemented as a maintenance task, the most dramatic change
of this series is to convert the 'git gc --auto' calls into 'git maintenance
run --auto' calls at the end of some Git commands. By default, the only
change is that 'git gc --auto' will be run below an additional 'git
maintenance' process.

The 'git maintenance' builtin has a 'run' subcommand so it can be extended
later with subcommands that manage background maintenance, such as 'start',
'stop', 'pause', or 'schedule'. These are not the subject of this series, as
it is important to focus on the maintenance activities themselves.

An expert user could set up scheduled background maintenance themselves with
the current series. I have the following crontab data set up to run
maintenance on an hourly basis:

0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

My config includes all tasks except the 'gc' task. The hourly run is
over-aggressive, but is sufficient for testing. I'll replace it with daily
when I feel satisfied.

Hopefully this direction is seen as a positive one. My goal was to add more
options for expert users, along with the flexibility to create background
maintenance via the OS in a later series.

OUTLINE
=======

Patches 1-4 remove some references to the_repository in builtin/gc.c before
we start depending on code in that builtin.

Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
commands.

Patches 8-15 create new maintenance tasks. These are the same tasks sent in
the previous RFC.

Patches 16-21 create more customization through config and perform other
polish items.

FUTURE WORK
===========

 * Add 'start', 'stop', and 'schedule' subcommands to initialize the
   commands run in the background.
   
   
 * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
   default, but might have different '--auto' conditions and more config
   options.
   
   
 * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
   with use of the 'commit-graph' task.
   
   

Thanks, -Stolee

Derrick Stolee (21):
  gc: use the_repository less often
  gc: use repository in too_many_loose_objects()
  gc: use repo config
  gc: drop the_repository in log location
  maintenance: create basic maintenance runner
  maintenance: add --quiet option
  maintenance: replace run_auto_gc()
  maintenance: initialize task array and hashmap
  maintenance: add commit-graph task
  maintenance: add --task option
  maintenance: take a lock on the objects directory
  maintenance: add fetch task
  maintenance: add loose-objects task
  maintenance: add pack-files task
  maintenance: auto-size pack-files batch
  maintenance: create maintenance.<task>.enabled config
  maintenance: use pointers to check --auto
  maintenance: add auto condition for commit-graph task
  maintenance: create auto condition for loose-objects
  maintenance: add pack-files auto condition
  midx: use start_delayed_progress()

 .gitignore                           |   1 +
 Documentation/config.txt             |   2 +
 Documentation/config/maintenance.txt |  32 +
 Documentation/fetch-options.txt      |   5 +-
 Documentation/git-clone.txt          |   7 +-
 Documentation/git-maintenance.txt    | 124 ++++
 builtin.h                            |   1 +
 builtin/am.c                         |   2 +-
 builtin/commit.c                     |   2 +-
 builtin/fetch.c                      |   6 +-
 builtin/gc.c                         | 881 +++++++++++++++++++++++++--
 builtin/merge.c                      |   2 +-
 builtin/rebase.c                     |   4 +-
 commit-graph.c                       |   8 +-
 commit-graph.h                       |   1 +
 config.c                             |  24 +-
 config.h                             |   2 +
 git.c                                |   1 +
 midx.c                               |  12 +-
 midx.h                               |   1 +
 object.h                             |   1 +
 run-command.c                        |   7 +-
 run-command.h                        |   2 +-
 t/t5319-multi-pack-index.sh          |  14 +-
 t/t5510-fetch.sh                     |   2 +-
 t/t5514-fetch-multiple.sh            |   2 +-
 t/t7900-maintenance.sh               | 211 +++++++
 27 files changed, 1265 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh


base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/671
-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux