Re: [PATCH 05/15] run-job: implement pack-files job

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> 
> The previous change cleaned up loose objects using the
> 'loose-objects' that can be run safely in the background. Add a
> similar job that performs similar cleanups for pack-files.
> 
> One issue with running 'git repack' is that it is designed to
> repack all pack-files into a single pack-file. While this is the
> most space-efficient way to store object data, it is not time or
> memory efficient. This becomes extremely important if the repo is
> so large that a user struggles to store two copies of the pack on
> their disk.
> 
> Instead, perform an "incremental" repack by collecting a few small
> pack-files into a new pack-file. The multi-pack-index facilitates
> this process ever since 'git multi-pack-index expire' was added in
> 19575c7 (multi-pack-index: implement 'expire' subcommand,
> 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
> (midx: implement midx_repack(), 2019-06-10).
> 
> The 'pack-files' job runs the following steps:
> 
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
> 
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."
> 
> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the

Typo: updaes -> updates


>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git

Is it still safe for concurrent processes if the repo did not have a
multi-pack-index when the first process started?


>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)
> 
> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set

Typo: bacth-size -> batch-size


>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

Typo: approximats -> approximates


>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.
> 
>    The next run of the pack-files job will delete these repacked
>    pack-files during the 'expire' step.
> 
>    In this version, the batch size is set to "0" which ignores the
>    size restrictions when selecting the pack-files. It instead
>    selects all pack-files and repacks all packed objects into a
>    single pack-file. This will be updated in the next change, but
>    it requires doing some calculations that are better isolated to
>    a separate change.
> 
> Each of the above steps update the multi-pack-index file. After
> each step, we verify the new multi-pack-index. If the new
> multi-pack-index is corrupt, then delete the multi-pack-index,
> rewrite it from scratch, and stop doing the later steps of the
> job. This is intended to be an extra-safe check without leaving
> a repo with many pack-files without a multi-pack-index.
> 
> These steps are based on a similar background maintenance step in
> Scalar (and VFS for Git) [1]. This was incredibly effective for
> users of the Windows OS repository. After using the same VFS for Git
> repository for over a year, some users had _thousands_ of pack-files
> that combined to up to 250 GB of data. We noticed a few users were
> running into the open file descriptor limits (due in part to a bug
> in the multi-pack-index fixed by af96fe3392 (midx: add packs to
> packed_git linked list, 2019-04-29).
> 
> These pack-files were mostly small since they contained the commits
> and trees that were pushed to the origin in a given hour. The GVFS
> protocol includes a "prefetch" step that asks for pre-computed pack-
> files containing commits and trees by timestamp. These pack-files
> were grouped into "daily" pack-files once a day for up to 30 days.
> If a user did not request prefetch packs for over 30 days, then they
> would get the entire history of commits and trees in a new, large
> pack-file. This led to a large number of pack-files that had poor
> delta compression.
> 
> By running this pack-file maintenance step once per day, these repos
> with thousands of packs spanning 200+ GB dropped to dozens of pack-
> files spanning 30-50 GB. This was done all without removing objects
> from the system and using a constant batch size of two gigabytes.
> Once the work was done to reduce the pack-files to small sizes, the
> batch size of two gigabytes means that not every run triggers a
> repack operation, so the following run will not expire a pack-file.
> This has kept these repos in a "clean" state.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> ---
>  Documentation/git-run-job.txt | 18 ++++++-
>  builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
>  midx.c                        |  2 +-
>  midx.h                        |  1 +
>  t/t7900-run-job.sh            | 39 +++++++++++++++
>  5 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 43ca1160b5a..108ed25b8bd 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job (commit-graph|fetch|loose-objects)'
> +'git run-job (commit-graph|fetch|loose-objects|pack-files)'
>  
>  
>  DESCRIPTION
> @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
>  objects to prevent the job from taking too long on a repository with
>  many loose objects.
>  
> +'pack-files'::
> +
> +The `pack-files` job incrementally repacks the object directory using
> +the `multi-pack-index` feature. In order to prevent race conditions with
> +concurrent Git commands, it follows a two-step process. First, it
> +deletes any pack-files included in the `multi-pack-index` where none of
> +the objects in the `multi-pack-index` reference those pack-files; this
> +only happens if all objects in the pack-file are also stored in a newer
> +pack-file. Second, it selects a group of pack-files whose "expected
> +size" is below the batch size until the group has total expected size at
> +least the batch size; see the `--batch-size` option for the `repack`
> +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
> +zero, which is a special case that attempts to repack all pack-files
> +into a single pack-file.
> +
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,13 +1,14 @@
>  #include "builtin.h"
>  #include "config.h"
>  #include "commit-graph.h"
> +#include "midx.h"
>  #include "object-store.h"
>  #include "parse-options.h"
>  #include "repository.h"
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job (commit-graph|fetch|loose-objects)"),
> +	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
>  	NULL
>  };
>  
> @@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
>  	return prune_packed() || pack_loose();
>  }
>  
> +static int multi_pack_index_write(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "write",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int rewrite_multi_pack_index(void)
> +{
> +	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
> +
> +	unlink(midx_name);
> +	free(midx_name);
> +
> +	if (multi_pack_index_write()) {
> +		error(_("failed to rewrite multi-pack-index"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int multi_pack_index_verify(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "verify",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_expire(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "expire",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_repack(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "repack",
> +			 "--no-progress", "--batch-size=0", NULL);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	if (result && multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after repack"));
> +		result = rewrite_multi_pack_index();
> +	}
> +
> +	return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			return run_fetch_job();
>  		if (!strcmp(argv[0], "loose-objects"))
>  			return run_loose_objects_job();
> +		if (!strcmp(argv[0], "pack-files"))
> +			return run_pack_files_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/midx.c b/midx.c
> index 1527e464a7b..0f0d0a38812 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -36,7 +36,7 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
> diff --git a/midx.h b/midx.h
> index e6fa356b5ca..cf2c09dffc2 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -39,6 +39,7 @@ struct multi_pack_index {
>  
>  #define MIDX_PROGRESS     (1 << 0)
>  
> +char *get_midx_filename(const char *object_dir);
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 41da083257b..416ba04989d 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
>  '
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
>  
>  . ./test-lib.sh
>  
> @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
>  	test_cmp packs-between packs-after
>  '
>  
> +test_expect_success 'pack-files job' '
> +	packDir=.git/objects/pack &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +
> +	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +
> +	rm -f client/$packDir/pack-* &&
> +	rm -f client/$packDir/loose-* &&
> +
> +	ls client/$packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git -C client run-job pack-files &&
> +	ls client/$packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git -C client run-job pack-files &&
> +	ls client/.git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux