[PATCH/RFC 0/7] Add possibility to clone specific subdirectories

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This patch series adds a `--sparse-prefix=` option to multiple commands,
allowing fetching repository contents from only a subdirectory of a remote.

This works along with sparse-checkout, and is especially useful for repositories
where a subdirectory has meaning when standing alone.

* Motivation (example use cases)

1. Git repositories used for managing large/binary files
  My university has a repository containing lecture slides etc.
  as pdfs, with a subdirectory for each lecture. The bandwith for getting the
  whole repository (even with --depth=1) is 4GiB with significant processing
  time, getting the complete history of a single lecture uses 25MiB and
  completes instantly.
2. package-manager-like repositories. Examples:
  a) Arch Linux package build files repository [1]
  b) Rust crates.io packages [2]
  c) TypeScript type definitions [3]
3. Excluding a specific directory containing e.g. large binary assets
  Not currently possible with this patch set, but could be added
  (see problem 2 below).
4. Getting the history of a single file
5. Other uses
  As a non kernel developer, I wanted to quickly search through
  the code of only the btrfs filesystem using the git tools, but I do not have
  a local clone of the complete repository. Using `--depth=100` in combination
  with `--sparse-prefix=/fs/btrfs` allows me to have little bandwidth usage
  while still retaining some history.
6. This is trivial in SVN, and searching on the internet, there are multiple
questions about this feature [4-7]

* Examples usage:

Getting the source of the btrfs filesystem with a bit of history:

    $ git clone git@server:linux --depth=100 # shallow, not sparse
    Receiving objects: 100% (814945/814945), 438.55 MiB | 35.21 MiB/s, done.
    ...
    $ git clone git@server:linux --depth=100 --sparse-prefix=/fs/btrfs # sparse and shallow
    Receiving objects: 100% (503747/503747), 121.45 MiB | 59.75 MiB/s, done.
    ...
    $ cd linux && ls ./
    fs
    $ ls fs/
    btrfs
    $ git log --oneline
    (repo behaves the same as a full clone with sparse-checkout /fs/btrfs)



* Open problems:

1. Currently all trees are still included. It would be possible to
include only the trees relevant to the sparse files, which would significantly
reduce the pack sizes for repositories containing a lot of small files changing
often. For example package managers using git. Not sure in how many places all
trees are presumed present.

2. This patch set implements it as a simple single prefix check command line
option.
Using the exclude_list format (same as in sparse-checkout) might be useful.
The server needs to check these patterns for all files in history, so I'm not
sure if allowing multiple/complex patterns is a good idea.

3. This patch set assumes the sparse-prefix and sparse-checkout does not change.
running clone and fetch both need to have the --sparse-prefix= option, otherwise
complete packs will be fetched. Not sure what the best way to store the
information is, possibly create a new file `.git/sparse` similar to
`.git/shallow` containing the path(s).

3. Bitmap indices cannot be used, because they do not contain the paths of the
objects. So for creating packs, the whole DAG has to be walked.

4. Fsck complains about missing blobs. Should be fairly easy to fix.

5. Tests and documentation is missing.

[1]: https://git.archlinux.org/svntogit/packages.git/
[2]: https://github.com/rust-lang/crates.io-index
[3]: https://github.com/DefinitelyTyped/DefinitelyTyped
[4]: https://stackoverflow.com/questions/600079/is-there-any-way-to-clone-a-git-repositorys-sub-directory-only
[5]: https://stackoverflow.com/questions/11834386/cloning-only-a-subdirectory-with-git
[6]: https://askubuntu.com/questions/460885/how-to-clone-git-repository-only-some-directories
[7]: https://coderwall.com/p/o2fasg/how-to-download-a-project-subdirectory-from-github

Robin Ruede (7):
  list-objects: add sparse-prefix option to rev_info
  pack-objects: add sparse-prefix
  Skip checking integrity of files ignored by sparse
  fetch-pack: add sparse prefix to smart protocol
  fetch: add sparse-prefix option
  clone: add sparse-prefix option
  remote-curl: add sparse prefix

 builtin/clone.c        | 27 ++++++++++++++++++++++++---
 builtin/fetch-pack.c   |  6 ++++++
 builtin/fetch.c        | 19 ++++++++++++++-----
 builtin/pack-objects.c | 11 +++++++++++
 cache-tree.c           |  3 ++-
 connected.c            |  7 ++++++-
 fetch-pack.c           |  4 ++++
 fetch-pack.h           |  1 +
 list-objects.c         |  4 +++-
 remote-curl.c          | 17 ++++++++++++++++-
 revision.c             |  4 ++++
 revision.h             |  1 +
 transport.c            |  4 ++++
 transport.h            |  4 ++++
 upload-pack.c          | 15 ++++++++++++++-
 15 files changed, 114 insertions(+), 13 deletions(-)

-- 
2.9.1.283.g3ca5b4c.dirty

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]