Re: [PATCH 00/10] RFC Partial Clone and Fetch

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 3 May 2017 13:40:08 -0700

On 05/03/2017 09:38 AM, Jeff Hostetler wrote:

On 3/8/2017 1:50 PM, git@xxxxxxxxxxxxxxxxx wrote:
From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>

[RFC] Partial Clone and Fetch
=============================
[...]
E. Unresolved Thoughts
======================

*TODO* The server should optionally return (in a side-band?) a list
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.

Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@xxxxxxxxxxxxxx/T/

I want to respond to both of these threads here.
-------------------------------------------------

Thanks for your input. I see that you have explained both "storing 
'positive' information about missing blobs" and "what to store with 
those positive information"; I'll just comment on the former for now.

Missing-Blob Support
====================

Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.

I thought about this (see "Some alternative designs" in [1]), listing 
some similar benefits, but concluded that "it is difficult to scale to 
large repos".

Firstly, to be clear, by large repos I meant (and mean) the svn-style 
"monorepos" that Jonathan Nieder mentions as use case "A" [2].

My concern is that such lists (whether in separate file(s) or in .idx 
files) would be too unwieldy to manipulate. Even if we design things to 
avoid modifying such lists (for example, by adding a new list whenever 
we fetch instead of trying to modify an existing one), we would at least 
need to sort their contents (for example, when generating an .idx in the 
first place). For a repo with 10M-100M blobs [3], this might be doable 
on today's computers, but I would be concerned if a repo would exceed 
such numbers.

[1] <20170426221346.25337-1-jonathantanmy@xxxxxxxxxx>
[2] <20170503182725.GC28740@xxxxxxxxxxxxxxxxxxxxxxxxx>
[3] In Microsoft's announcement of Git Virtual File System [4], they 
mentioned "over 3.5 million files" in the Windows codebase. I'm not sure 
if this refers to files in a snapshot (that is, working copy) or all 
historical versions.
[4] 
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/

1. Suppose we update the .pack file format slightly.
   () We use the 5 value in "enum object_type" to mean a
      "missing-blob".
   () We update git-pack-object as I did in my RFC, but have it
      create type 5 entries for the blobs that are omitted,
      rather than nothing.
   () Hopefully, the same logic that currently keeps pack-object
      from sending unnecessary blobs on subsequent fetches can
      also be used to keep it from sending unnecessary missing-blob
      entries.
   () The type 5 missing-blob entry would contain the SHA-1 of the
      blob and some meta-data to be explained later.

My original idea was to have sorted list(s) of hashes in separate 
file(s) much like the currently existing shallow file; it would have the 
semantics of "a hash here might be present or absent; if it is absent, 
use the hook". (Initially I thought that one list would be sufficient, 
but after reading your idea and considering it some more, multiple lists 
might be better.)

Your idea of storing them in an .idx (and possibly corresponding .pack 
file) is similar, I think. Although mine is probably simpler - at least, 
we wouldn't need a new object_type.

As described above, I don't think this list-of-hashes idea will work 
(because of the large numbers of blobs involved), but I'll compare it to 
yours anyway just in case we end up being convinced that this general 
idea works.

2. Make a similar change in the .idx format and git-index-pack
   to include them there.  Then blob lookup operations could
   definitively determine that a blob exists and is just not
   present locally.

3. With this, packfile-based blob-lookup operations can get a
   "missing-blob" result.
   () It should be possible to short-cut searching in other
      packfiles (because we don't have to assume that the blob
      was just misplaced in another packfile).
   () Lookup can still look for the corresponding loose blob
      (in case a previous lookup already "faulted it in").

The binary search to lookup a packfile offset from a .idx file (which 
involves disk reads) would take longer for all lookups (not just lookups 
for missing blobs) - I think I prefer keeping the lists separate, to 
avoid pessimizing the (likely) usual case where the relevant blobs are 
all already in local repo storage.

4. We can then think about dynamically fetching it.
   () Several techniques for this are currently being
      discussed on the mailing list in other threads,
      so I won't go into this here.
   () There has also been debate about whether this should
      yield a loose blob or a new packfile.  I think both
      forms have merit and depend on whether we are limited
      to asking for a single blob or can make a batch request.
   () A dynamically-fetched loose blob is placed in the normal
      loose blob directory hierarchy so that subsequent
      lookups can find it as mentioned above.
   () A dynamically-fetched packfile (with one or more blobs)
      is written to the ODB and then the lookup operation
      completes.
      {} I want to isolate these packfiles from the main
         packfiles, so that they behave like a second-stage
         lookup and don't affect the caching/LRU nature of
         the existing first-stage packfile lookup.
      {} I also don't want the ambiguity of having 2 primary
         packfiles with a blob marked as missing in 1 and
         present in the other.

With my idea, the second-stage lookup is done on the list of missing 
hashes; there is no division between packfiles.

5. git-repack should be updated to "do the right thing" and
   squash missing-blob entries.

6. And etc.