[PATCH] pack-objects: introduce --exclude-delta=<pattern> option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: ZheNing Hu <adlternative@xxxxxxxxx>

The server uses delta compression during git clone to reduce
the amount of data transferred over the network, but delta
compression for large binary blobs often does not reduce
storage size significantly and wastes a lot of CPU. Git now
disables delta compression for objects that meet these conditions:

1. files that have -delta set in .gitattributes
2. files that its size exceed the big_file_threshold

However, in 1, .gitattributes needs to be set manually by the user,
and in most cases the user does not actively set it, and it is not
something that can be actively adjusted on the server aside. In 2,
the big_file_threshold now defaults to 512MB, and many binary files
smaller than that will be uselessly delta-compressed, and this is
made worse if the server actively increases the big_file_threshold.

Therefore, we need a way to be able to actively skip the delta
compression of some files on the server. Introduces the
`-exclude-delta=<pattern>` option, which can be used to disable delta
compression for objects that satisfy the pattern.

Signed-off-by: ZheNing Hu <adlternative@xxxxxxxxx>
---
    pack-objects: introduce --exclude-delta= option
    
    While analyzing some repositories using git filter-repo -analyze, I
    noticed that many huge binaries in the repositories were
    delta-compressed without much reduction in size.
    
    $ cat .git/filter-repo/analysis/path-all-sizes.txt | more === All paths
    by reverse accumulated size === Format: unpacked size, packed size, date
    deleted, path name 23816778 23765921 2022-08-22
    managed/src/universal/ybc/ybc-1.0.0-b1-linux-x86_64.tar.gz 22504398
    22445676 2022-08-22
    managed/src/universal/ybc/ybc-1.0.0-b1-el8-aarch64.tar.gz 11726471
    6424233 2022-08-09 managed/yba-installer/yba-installer_linux_amd64
    294644800 5794201 src/yb/master/catalog_manager.cc 2912780 2872186
    docs/static/images/yp/tables-view-ycql.png 2992192 2634232
    docs/static/images/yb-cloud/cloud-clusters-backups.png 2757095 2501915
    docs/static/images/deploy/aws/aws-cf-configure-options.png ...
    
    The current solution to avoid delta compression is not very suitable for
    git servers. First, files that exceed the big_file_threshold are not
    delta compressed, but the above analysis indicates that many big binary
    files do not exceed the the big_file_threshold (default to 512MB).
    Second, there is not .gitattrbutes to disable delta compression for
    them, we also don't really can let repo administrators add it manually.
    
    But we can also see that the large files in these repositories often
    have some common characteristics: they end in ".tar.gz"or “.png". So
    perhaps we can take advantage of this feature and disable delta
    compression on the server for some common type binary files.
    
    This is currently implemented by command line parameters
    --exclude-delta=<pattern>. But maybe we can also try passing it through
    git config.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1392%2Fadlternative%2Fadl%2Fpack-object-no-try-delta-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1392/adlternative/adl/pack-object-no-try-delta-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1392

 Documentation/git-pack-objects.txt |  6 +++++-
 builtin/pack-objects.c             | 28 +++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index a9995a932ca..92cfee83df5 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -13,7 +13,7 @@ SYNOPSIS
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
 	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
-	[--cruft] [--cruft-expiration=<time>]
+	[--cruft] [--cruft-expiration=<time>] [--exclude-delta=<file>]
 	[--stdout [--filter=<filter-spec>] | <base-name>]
 	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
 
@@ -221,6 +221,10 @@ depth is 4095.
 	This flag tells the command not to reuse existing deltas
 	but compute them from scratch.
 
+--exclude-delta=<pattern>::
+	Delta compression will not be attempted for blobs for paths
+	matching pattern. See linkgit:gitignore[5] for pattern details.
+
 --no-reuse-object::
 	This flag tells the command not to reuse existing object data at all,
 	including non deltified object, forcing recompression of everything.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3658c05cafc..ab9cff98e3a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -272,6 +272,8 @@ static struct commit **indexed_commits;
 static unsigned int indexed_commits_nr;
 static unsigned int indexed_commits_alloc;
 
+static struct pattern_list *exclude_delta_patterns;
+
 static void index_commit_for_bitmap(struct commit *commit)
 {
 	if (indexed_commits_nr >= indexed_commits_alloc) {
@@ -1315,13 +1317,20 @@ static void write_pack_file(void)
 static int no_try_delta(const char *path)
 {
 	static struct attr_check *check;
+	int dtype;
 
 	if (!check)
 		check = attr_check_initl("delta", NULL);
 	git_check_attr(the_repository->index, path, check);
 	if (ATTR_FALSE(check->items[0].value))
 		return 1;
-	return 0;
+
+	return exclude_delta_patterns &&
+		path_matches_pattern_list(path,
+					  strlen(path),
+					  path, &dtype,
+					  exclude_delta_patterns,
+					  the_repository->index) == MATCHED;
 }
 
 /*
@@ -4149,6 +4158,19 @@ static int option_parse_cruft_expiration(const struct option *opt,
 	return 0;
 }
 
+static int option_parse_exclude_delta(const struct option *opt,
+					 const char *arg, int unset)
+{
+	BUG_ON_OPT_NEG(unset);
+
+	if (!exclude_delta_patterns)
+		exclude_delta_patterns = xcalloc(1, sizeof(*exclude_delta_patterns));
+
+	if (arg)
+		add_pattern(arg, "", 0, exclude_delta_patterns, 0);
+	return 0;
+}
+
 struct po_filter_data {
 	unsigned have_revs:1;
 	struct rev_info revs;
@@ -4242,6 +4264,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
 		  N_("expire cruft objects older than <time>"),
 		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
+		OPT_CALLBACK_F(0, "exclude-delta", NULL, N_("pattern"),
+		  N_("disable delta compression for files matching pattern"),
+		  PARSE_OPT_NONEG, option_parse_exclude_delta),
 		OPT_BOOL(0, "sparse", &sparse,
 			 N_("use the sparse reachability algorithm")),
 		OPT_BOOL(0, "thin", &thin,
@@ -4514,6 +4539,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 cleanup:
 	strvec_clear(&rp);
+	FREE_AND_NULL(exclude_delta_patterns);
 
 	return 0;
 }

base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886
-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux