Re: Git commit causes data download in partial clone

Jeff King <peff@xxxxxxxx> · Mon, 19 Feb 2024 21:43:10 -0500

On Sat, Feb 17, 2024 at 08:38:08PM +0000, charmocc wrote:

> I was recently exploring git partial clone feature because I wanted to
> contribute to repository which has a lot of binary files. My intent was to only
> add new files without modifying any existing ones and to download as few data
> as possible in the process. Here are the steps I followed:
> 
> $ git clone --no-checkout --filter=blob:none https://github.com/libretro-thumbnails/Nintendo_-_Nintendo_Entertainment_System.git nes
> $ cd nes
> $ echo foo > bar
> $ git add bar
> $ git commit bar # causes git fetch behind the scene and download of a lot of objects!
> 
> Now for reasons I don't understand the last command cause download of a lot of
> objects from remote (blobs) which is what I was trying to avoid. By enabling
> tracing options I can see that it runs fetch operation in the background:

I think what is happening is something like:

  1. You clone with --no-checkout, so you do not fetch any of the blobs.
     But you also have an empty index, with no entries at all.

  2. Running "git commit" is going to need all of those entries in the
     index (to compute the hash of the new tree). So it will read it
     from the tree of the current HEAD.

  3. When we load entries into the index, the usual next thing to do is
     to check them out. So rather than fetch them one by one as we do
     the actual checkout, the index-reading code collects all of the
     entries we don't have and then does a single fetch for them. This
     is prefetch_cache_entries() in read-cache.c.

Now obviously in your example, the "usual" thing is not happening; we do
not intend to write those entries into the working tree, so fetching
them is pointless.

There may be some room for improvement here. E.g., teaching the
index-reading code a flag that says "don't bother prefetching", and use
it in this call chain. I'm not sure if there would be other gotchas,
though.

But here are a few alternatives that you can try without making any code
changes:

  a. Your --no-checkout skips the checkout, but it does not tell Git
     that you are fundamentally uninterested in those other paths. To do
     that, you can try the sparse-checkout mechanism. I'm not super
     familiar with the feature myself, but doing:

       git clone --sparse --filter=blob:none $url nes

     ends up with an empty checkout to which you can add things (the
     trick is that we do have all of those index entries, but they are
     marked as "not interesting").

     Do note that --sparse checks out the contents of the top-level tree
     by default. That's OK for your repo (all of the files are in the
     Named_Titles directory), but it might not be true for some other
     repos (it may also not work if your intent is to put another entry
     into Named_Titles, though it looks like you might just need to say
     "git add --sparse").

  b. Skip the index entirely and just construct your own tree/commit.
     E.g., doing:

       blob=$(git hash-object -w some-file)
       tree=$({
                git ls-tree HEAD &&
		printf "100644 blob $blob\t%s" some-file
	      } | git mktree --missing)
       commit=$(echo my commit message | git commit-tree -p HEAD $tree)
       git update-ref HEAD $commit

    It gets a little trickier if your want to add to a sub-directory
    (you have to recursively generate each tree).

In both cases you might also want to clone with "--depth 1", so you do
not bother grabbing old commits and trees, either.

> git version 2.34.1 (Ubuntu 22.04)

The sparse-checkout feature is new-ish and has been actively worked on
in the past few years. What I showed above works with the latest release
of Git, but you may or may not need to upgrade (I didn't dig into the
details).

-Peff