Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/8/2020 7:57 PM, Emily Shaffer wrote:
> On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> This is a second attempt at redesigning Git's repository maintenance
>> patterns. The first attempt [1] included a way to run jobs in the background
>> using a long-lived process; that idea was rejected and is not included in
>> this series. A future series will use the OS to handle scheduling tasks.
>>
>> [1] 
>> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@xxxxxxxxx/
>>
>> As mentioned before, git gc already plays the role of maintaining Git
>> repositories. It has accumulated several smaller pieces in its long history,
>> including:
>>
>>  1. Repacking all reachable objects into one pack-file (and deleting
>>     unreachable objects).
>>  2. Packing refs.
>>  3. Expiring reflogs.
>>  4. Clearing rerere logs.
>>  5. Updating the commit-graph file.
>>
>> While expiring reflogs, clearing rererelogs, and deleting unreachable
>> objects are suitable under the guise of "garbage collection", packing refs
>> and updating the commit-graph file are not as obviously fitting. Further,
>> these operations are "all or nothing" in that they rewrite almost all
>> repository data, which does not perform well at extremely large scales.
>> These operations can also be disruptive to foreground Git commands when git
>> gc --auto triggers during routine use.
>>
>> This series does not intend to change what git gc does, but instead create
>> new choices for automatic maintenance activities, of which git gc remains
>> the only one enabled by default.
>>
>> The new maintenance tasks are:
>>
>>  * 'commit-graph' : write and verify a single layer of an incremental
>>    commit-graph.
>>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>>    a batch of loose objects.
>>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>>    repack using the multi-pack-index's incremental repack strategy.
>>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
>>
>> These tasks are all disabled by default, but can be enabled with config
>> options or run explicitly using "git maintenance run --task=". There are
>> additional config options to allow customizing the conditions for which the
>> tasks run during the '--auto' option. ('fetch' will never run with the
>> '--auto' option.)
>>
>>  Because 'gc' is implemented as a maintenance task, the most dramatic change
>> of this series is to convert the 'git gc --auto' calls into 'git maintenance
>> run --auto' calls at the end of some Git commands. By default, the only
>> change is that 'git gc --auto' will be run below an additional 'git
>> maintenance' process.
>>
>> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
>> later with subcommands that manage background maintenance, such as 'start',
>> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
>> it is important to focus on the maintenance activities themselves.
>>
>> An expert user could set up scheduled background maintenance themselves with
>> the current series. I have the following crontab data set up to run
>> maintenance on an hourly basis:
>>
>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
> 
> One thing I wonder about - now I have to go and make a new crontab
> (which is easy) or Task Scheduler task (which is a pain) for every repo,
> right?
> 
> Is it infeasible to ask for 'git maintenance' to learn something like
> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> config like "maintenance.targetRepo = /<path-to-repo>"?
> 
>>
>> My config includes all tasks except the 'gc' task. The hourly run is
>> over-aggressive, but is sufficient for testing. I'll replace it with daily
>> when I feel satisfied.
>>
>> Hopefully this direction is seen as a positive one. My goal was to add more
>> options for expert users, along with the flexibility to create background
>> maintenance via the OS in a later series.
>>
>> OUTLINE
>> =======
>>
>> Patches 1-4 remove some references to the_repository in builtin/gc.c before
>> we start depending on code in that builtin.
>>
>> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
>> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
>> commands.
> 
> For me, I'd prefer to see 'git maintenance run' get bigger and 'git gc
> --auto' get smaller or disappear. Is there a plan towards that
> direction, or is that out of scope for 'git maintenance run'? Similar
> examples I can think of include 'git annotate' and 'git pickaxe'.

Thanks for these examples of prior work. I'll keep them in mind.

>>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>>    default, but might have different '--auto' conditions and more config
>>    options.
> 
> Like I mentioned above, for me, I'd rather just see the 'gc' builtin go
> away :)

My hope is that we can absolutely do that. I didn't want to start that
exercise yet, as I don't want to disrupt existing workflows more than
I already am.

It is important to recognize that there are already several "tasks" that
run inside 'gc' including:

1. Expiring reflogs.
2. Repacking all reachable objects.
3. Deleting unreachable objects.
4. Packing refs.

Before trying to "remove" the gc builtin, we would want these to be
represented in the 'git maintenance run' as tasks.

In that direction, I realized after submitting that I should rename
the 'pack-files' task in this submission to 'incremental-repack'
instead, allowing a later 'full-repack' task to represent the role
of that step in the 'gc' task. Some users will prefer one over the
other. Perhaps this incremental/full distinction makes it clear that
there are trade-offs in both directions.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux