From: ZheNing Hu <adlternative@xxxxxxxxx> The server uses delta compression during git clone to reduce the amount of data transferred over the network, but delta compression for large binary blobs often does not reduce storage size significantly and wastes a lot of CPU. Git now disables delta compression for objects that meet these conditions: 1. files that have -delta set in .gitattributes 2. files that its size exceed the big_file_threshold However, in 1, .gitattributes needs to be set manually by the user, and in most cases the user does not actively set it, and it is not something that can be actively adjusted on the server aside. In 2, the big_file_threshold now defaults to 512MB, and many binary files smaller than that will be uselessly delta-compressed, and this is made worse if the server actively increases the big_file_threshold. Therefore, we need a way to be able to actively skip the delta compression of some files on the server. Introduces the `-exclude-delta=<pattern>` option, which can be used to disable delta compression for objects that satisfy the pattern. Signed-off-by: ZheNing Hu <adlternative@xxxxxxxxx> --- pack-objects: introduce --exclude-delta= option While analyzing some repositories using git filter-repo -analyze, I noticed that many huge binaries in the repositories were delta-compressed without much reduction in size. $ cat .git/filter-repo/analysis/path-all-sizes.txt | more === All paths by reverse accumulated size === Format: unpacked size, packed size, date deleted, path name 23816778 23765921 2022-08-22 managed/src/universal/ybc/ybc-1.0.0-b1-linux-x86_64.tar.gz 22504398 22445676 2022-08-22 managed/src/universal/ybc/ybc-1.0.0-b1-el8-aarch64.tar.gz 11726471 6424233 2022-08-09 managed/yba-installer/yba-installer_linux_amd64 294644800 5794201 src/yb/master/catalog_manager.cc 2912780 2872186 docs/static/images/yp/tables-view-ycql.png 2992192 2634232 docs/static/images/yb-cloud/cloud-clusters-backups.png 2757095 2501915 docs/static/images/deploy/aws/aws-cf-configure-options.png ... The current solution to avoid delta compression is not very suitable for git servers. First, files that exceed the big_file_threshold are not delta compressed, but the above analysis indicates that many big binary files do not exceed the the big_file_threshold (default to 512MB). Second, there is not .gitattrbutes to disable delta compression for them, we also don't really can let repo administrators add it manually. But we can also see that the large files in these repositories often have some common characteristics: they end in ".tar.gz"or “.png". So perhaps we can take advantage of this feature and disable delta compression on the server for some common type binary files. This is currently implemented by command line parameters --exclude-delta=<pattern>. But maybe we can also try passing it through git config. Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1392%2Fadlternative%2Fadl%2Fpack-object-no-try-delta-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1392/adlternative/adl/pack-object-no-try-delta-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1392 Documentation/git-pack-objects.txt | 6 +++++- builtin/pack-objects.c | 28 +++++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index a9995a932ca..92cfee83df5 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -13,7 +13,7 @@ SYNOPSIS [--no-reuse-delta] [--delta-base-offset] [--non-empty] [--local] [--incremental] [--window=<n>] [--depth=<n>] [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] - [--cruft] [--cruft-expiration=<time>] + [--cruft] [--cruft-expiration=<time>] [--exclude-delta=<file>] [--stdout [--filter=<filter-spec>] | <base-name>] [--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list> @@ -221,6 +221,10 @@ depth is 4095. This flag tells the command not to reuse existing deltas but compute them from scratch. +--exclude-delta=<pattern>:: + Delta compression will not be attempted for blobs for paths + matching pattern. See linkgit:gitignore[5] for pattern details. + --no-reuse-object:: This flag tells the command not to reuse existing object data at all, including non deltified object, forcing recompression of everything. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 3658c05cafc..ab9cff98e3a 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -272,6 +272,8 @@ static struct commit **indexed_commits; static unsigned int indexed_commits_nr; static unsigned int indexed_commits_alloc; +static struct pattern_list *exclude_delta_patterns; + static void index_commit_for_bitmap(struct commit *commit) { if (indexed_commits_nr >= indexed_commits_alloc) { @@ -1315,13 +1317,20 @@ static void write_pack_file(void) static int no_try_delta(const char *path) { static struct attr_check *check; + int dtype; if (!check) check = attr_check_initl("delta", NULL); git_check_attr(the_repository->index, path, check); if (ATTR_FALSE(check->items[0].value)) return 1; - return 0; + + return exclude_delta_patterns && + path_matches_pattern_list(path, + strlen(path), + path, &dtype, + exclude_delta_patterns, + the_repository->index) == MATCHED; } /* @@ -4149,6 +4158,19 @@ static int option_parse_cruft_expiration(const struct option *opt, return 0; } +static int option_parse_exclude_delta(const struct option *opt, + const char *arg, int unset) +{ + BUG_ON_OPT_NEG(unset); + + if (!exclude_delta_patterns) + exclude_delta_patterns = xcalloc(1, sizeof(*exclude_delta_patterns)); + + if (arg) + add_pattern(arg, "", 0, exclude_delta_patterns, 0); + return 0; +} + struct po_filter_data { unsigned have_revs:1; struct rev_info revs; @@ -4242,6 +4264,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"), N_("expire cruft objects older than <time>"), PARSE_OPT_OPTARG, option_parse_cruft_expiration), + OPT_CALLBACK_F(0, "exclude-delta", NULL, N_("pattern"), + N_("disable delta compression for files matching pattern"), + PARSE_OPT_NONEG, option_parse_exclude_delta), OPT_BOOL(0, "sparse", &sparse, N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", &thin, @@ -4514,6 +4539,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) cleanup: strvec_clear(&rp); + FREE_AND_NULL(exclude_delta_patterns); return 0; } base-commit: 1fc3c0ad407008c2f71dd9ae1241d8b75f8ef886 -- gitgitgadget