[PATCH 0/6] PATH WALK I: The path-walk API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Introduction and relation to prior series
=========================================

This is a new series that rerolls the initial "path-walk API" patches of my
RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
path-walk.h) presents a new way to walk objects such that trees and blobs
are walked in batches according to their path.

This also replaces the previous version of ds/path-walk that was being
reviewed in [2]. The consensus was that the series was too long/dense and
could use some reduction in size. This series takes the first few patches,
but also makes some updates (which will be described later).

[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@xxxxxxxxx/
[RFC] Path-walk API and applications

[2]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@xxxxxxxxx/
[PATCH v2 00/17] pack-objects: add --path-walk option for better deltas

This series only introduces the path-walk API, but does so to the full
complexity required to later add the integration with git pack-objects to
improve packing compression in both time and space for repositories with
many name hash collisions. The compression also at least improves for other
repositories, but may not always have an improvement in time.

Some of the changes that are present in this series that differ from the
previous version are motivated directly by discoveries made by testing the
feature in Git for Windows and microsoft/git forks that shipped these
features for fast delivery of these improvements to users who needed them.
That testing across many environments informed some things that needed to be
changed, and in this series those changes are checked by tests in the
t6601-path-walk.sh test script and the test-tool path-walk test helper.
Thus, the code being introduced in this series is covered by tests even
though it is not integrated into the git executable.


Discussion of follow-up applications
====================================

By splitting this series out into its own, I was able to reorganize the
patches such that each application can be build independently off of this
series. These are available as pending PRs in gitgitgadget/git:

 * Better delta compression with 'git pack-objects' [3]: This application
   allows an option in 'git pack-objects' to change how objects are walked
   in order to group objects with the same path for early delta compression
   before using the name hash sort to look for cross-path deltas. This helps
   significantly in repositories with many name-hash collisions. This
   reduces the size of 'git push' pacifies via a config option and reduces
   the total repo size in 'git repack'.

 * The 'git backfill' command [4]: This command downloads missing blobs in a
   bloodless partial clone. In order to save space and network bandwidth, it
   assumes that objects at a common path are likely to delta well with each
   other, so it downloads missing blobs in batches via the path-walk API.
   This presents a way to use blobless clones as a pseudo-resumable clone,
   since the initial clone of commits and trees is a smaller initial
   download and the batch size allows downloading blobs incrementally. When
   pairing this command with the sparse-checkout feature, the path-walk API
   is adjusted to focus on the paths within the sparse-checkout. This allows
   the user to only download the files they are likely to need when
   inspecting history within their scope without downloading the entire
   repository history.

 * The 'git survey' command [5]. This application begins the work to mimic
   the behavior of git-sizer, but to use internal data structures for better
   performance and careful understanding of how objects are stored. Using
   the path-walk API, paths with many versions can be considered in a batch
   and sorted into a list to report the paths that contribute most to the
   size of the repository. A version of this command was used to help
   confirm the issues with the name hash collisions. It was also used to
   diagnose why some repacks using the --path-walk option were taking more
   space than without for some repositories. (More on this later.)

Question for reviewers: I am prepped to send these three applications to the
mailing list, but I'll refrain for now to avoid causing too much noise for
folks. Would you like to see them on-list while this series is under review?
Or would you prefer to explore the PRs ([3] [4] and [5])?

[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'

[4] https://github.com/gitgitgadget/git/pull/1820
PATH WALK III: Add 'git backfill' command

[5] https://github.com/gitgitgadget/git/pull/1821
PATH WALK IV: Add 'git survey' command


Structure of the Patch Series
=============================

This patch series attempts to create the simplest version of the API in
patch 1, then build functionality incrementally. During the process, each
change will introduce an update to:

 * The path-walk API itself in path-walk.c and path-walk.h.
 * The documentation of the API in
   Documentation/technical/api-path-walk.txt.
 * The test script t/t6601-path-walk.sh.

The core of the API relies on using a 'struct rev_info' to define an initial
set of objects and some form of a commit walk to define what range of
objects to visit. Initially, only a subset of 'struct rev_info' options work
as expected. For example:

 * Patch 1 assumes that only commit objects are starting positions, but the
   focus is on exploring trees and blobs.
 * Patch 3 allows users to specify object types, which includes outputting
   the visited commits in a batch.
 * Annotated tags and indexed objects are considered in Patch 4. These are
   grouped because they both exist within the 'pending' object list.
 * UNINTERESTING objects are not considered until Patch 5.


Changes since previous version
==============================

There are a few hard-won learnings from previous versions of this series due
to testing this in the wild with many different repositories.

 * Initially, the 'git pack-objects --path-walk' feature was not tested with
   the '--shallow' option because it was expected that this option was for
   servers creating a pack containing shallow commits. However, this option
   is also used when pushing from a shallow clone, and this was a critical
   feature that we needed to reduce the size of commits pushed from
   automated environments that were bootstrapped by shallow clones. The crux
   of the change is in Patch 5 and how UNINTERESTING objects are handled. We
   no longer need to push the UNINTERESTING flag around the objects
   ourselves and can use existing logic in list-objects.c to do so. This
   allows using the --objects-edge-aggressive option when necessary to
   reduce the object count when pushing from a shallow clone. (The
   pack-objects series expands on tests to cover this integration point.)

 * When looking into cases where 'git repack -adf' outperformed 'git repack
   -adf --path-walk', I discovered that the issue did not reproduce in a
   bare repository. This is due to 'git repack' iterating over all indexed
   objects before walking commits. I had inadvertently put all indexed
   objects in their own category, leading to no good deltas with previous
   versions of those files; I had also not used the 'path' option from the
   pending list, so these objects had invalid name hash values. You will see
   in patch 4 that the pending list is handled quite differently and the
   '--indexed-objects' option is tested directly within t6601.

 * I added a new 'test_cmp_sorted' helper because I wanted to simplify some
   repeated sections of t6601.

 * Patch 1 has significantly more context than it did before.

 * Annotated tags are given a name of "/tags" to differentiate them slightly
   from root trees and commits.

I will include a full range diff relative to the previous versions of these
patches in [2] in a reply to this cover letter.

Thanks, -Stolee

Derrick Stolee (6):
  path-walk: introduce an object walk by path
  test-lib-functions: add test_cmp_sorted
  t6601: add helper for testing path-walk API
  path-walk: allow consumer to specify object types
  path-walk: visit tags and cached objects
  path-walk: mark trees and blobs as UNINTERESTING

 Documentation/technical/api-path-walk.txt |  63 +++
 Makefile                                  |   2 +
 path-walk.c                               | 529 ++++++++++++++++++++++
 path-walk.h                               |  65 +++
 t/helper/test-path-walk.c                 | 114 +++++
 t/helper/test-tool.c                      |   1 +
 t/helper/test-tool.h                      |   1 +
 t/t6601-path-walk.sh                      | 366 +++++++++++++++
 t/test-lib-functions.sh                   |  10 +
 9 files changed, 1151 insertions(+)
 create mode 100644 Documentation/technical/api-path-walk.txt
 create mode 100644 path-walk.c
 create mode 100644 path-walk.h
 create mode 100644 t/helper/test-path-walk.c
 create mode 100755 t/t6601-path-walk.sh


base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1818%2Fderrickstolee%2Fapi-upstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1818/derrickstolee/api-upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1818
-- 
gitgitgadget




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux